Understanding Embeddings: The Semantic Backbone of LLMs

January 21, 2025

AI LLM Embeddings NLP Machine Learning Deep Learning

What Are Embeddings?

When you prompt an LLM to help debug your code, your words don't stay as text for long. They're instantly transformed into vectors—lists of numbers that capture meaning in a form machines can process. This transformation is called embedding, and it's the gateway between human language and machine understanding.

Think of embeddings as a translation layer. Just as a traveler needs a phrasebook to communicate in a foreign country, language models need embeddings to understand text. But instead of translating English to French, embeddings translate words into coordinates in a mathematical space where semantic relationships become geometric relationships.

In this space, words with similar meanings cluster together. "Cat" and "dog" end up near each other, while "strawberry" sits far away. This isn't magic—it's mathematics carefully designed to preserve meaning.

The Evolution of Embeddings

Embeddings didn't start with modern AI. They evolved over decades, growing from simple statistical tricks to sophisticated neural networks:

Traditional Methods (1990s-2000s): Count-based approaches like TF-IDF and co-occurrence matrices
Static Word Embeddings (2013-2015): Word2Vec, GloVe, FastText
Contextualized Embeddings (2018-present): ELMo, BERT, GPT, modern LLMs

The key shift happened when we moved from static to contextual embeddings. Early methods gave every word a single, fixed vector. The word "bank" always had the same representation, whether you meant a river bank or a financial institution. Modern approaches generate different vectors based on context, understanding that "bank" in "river bank" means something completely different from "bank" in "bank robbery."

Note: In technical literature, "embeddings" can refer to both the initial token vectors and the contextual representations produced by deeper layers. This can be confusing, but both are embeddings—just at different stages of processing.

What Makes a Good Embedding?

Before diving into techniques, let's understand what we're optimizing for:

1. Semantic Representation

Good embeddings capture meaning. Words with similar meanings should have similar vectors. This allows models to understand that "happy," "joyful," and "pleased" are related, even if they've never seen them used together.

2. Dimensionality

Embeddings are vectors, and vectors have size. Should an embedding be 50 dimensions? 300? 1,536?

Smaller dimensions (50-100): Fast and memory-efficient, but may miss subtle relationships
Larger dimensions (300-1000+): Capture more nuance but require more compute and risk overfitting

For reference: GPT-2 uses 768 dimensions, while DeepSeek-V3 uses 7,168. The choice depends on model size and task complexity.

Traditional Embedding Techniques

Early embedding methods used a clever intuition: words that appear together probably have related meanings. If "coffee" and "hot" frequently appear near each other in text, they're likely connected semantically.

TF-IDF: The Word Importance Score

TF-IDF (Term Frequency-Inverse Document Frequency) is one of the simplest embedding approaches. Instead of complex neural networks, it uses basic statistics to answer: "How important is this word to this document?"

Here's the intuition:

Step 1: How often does the word appear here?
If "cat" appears 5 times in a 100-word document, it's clearly important to that document. This is Term Frequency.

Step 2: Is this word common everywhere?
Words like "the," "a," and "is" appear in almost every document. They're not special. But if "quantum" only appears in 2 out of 100 documents, it's probably meaningful. This is Inverse Document Frequency—rarer words get higher scores.

Step 3: Combine both insights
Multiply the two scores together. Words that are both frequent in this document and rare across all documents get the highest importance scores.

A Simple Example

Imagine you have 100 documents and you're analyzing the word "machine":

It appears 8 times in your current document (which has 200 words total)
It appears in only 10 of the 100 documents

Term Frequency: 8/200 = 4% of the document
Inverse Document Frequency: The word appears in 10% of documents, so it's relatively rare
TF-IDF Score: Combine these to get an importance score

The math behind this is simple multiplication and logarithms, but you don't need to memorize formulas. The key insight is: frequent here, rare everywhere else = important.

What TF-IDF Misses

When we visualize TF-IDF embeddings, we see a problem:

Most words cluster in one area—the embeddings lack diversity
Semantic relationships are missing—"king" and "queen" aren't necessarily close together

TF-IDF works for simple tasks like keyword extraction or document search, but it can't capture meaning. That's where neural approaches come in.

Word2Vec: Learning Meaning from Context

Word2Vec, introduced in 2013, was revolutionary. Instead of counting words, it used a neural network to learn embeddings by predicting context.

The Core Idea

Word2Vec trains a simple neural network on a fake task: predict missing words from their neighbors.

For example, given: "The cat sat on the ___"
The model learns to predict: "mat"

The trick? We don't actually care about the predictions. We care about the internal representations the network builds while learning to predict. Those internal representations become our embeddings.

Two Approaches

CBOW (Continuous Bag of Words): Given surrounding words, predict the middle word

Input: ["the", "cat", "on", "the"] → Predict: "sat"

Skip-gram: Given a word, predict its surrounding words

Input: "sat" → Predict: ["the", "cat", "on", "the"]

Why This Works

By training on billions of sentences, the network learns that words used in similar contexts must have similar meanings. "Cat" and "dog" often appear with words like "pet," "animal," "feed," so their embeddings naturally cluster together.

When visualized, Word2Vec embeddings show clear semantic structure. Related words form neighborhoods. You can even do vector arithmetic:

king - man + woman ≈ queen

This works because the "gender direction" in the embedding space is consistent across similar relationships.

Performance Note: Word2Vec is fast and effective. You can train it on moderate hardware and still get meaningful results. The pretrained Google News Word2Vec model contains 3 million words with 300-dimensional embeddings.

BERT: The Context Revolution

BERT (Bidirectional Encoder Representations from Transformers) changed everything in 2018. Instead of one fixed vector per word, BERT generates different vectors based on context.

The Architecture

BERT has four main components:

Tokenizer: Breaks text into tokens (subwords)
Embedding Layer: Converts tokens to initial vectors
Transformer Encoders: Multiple layers that process and update representations
Task Heads: Specialized outputs for different tasks

What Makes BERT Special

Bidirectional Context: BERT reads both left and right context simultaneously. When processing the word "bank," it sees both "river" before it and "eroded" after it, understanding it's about geography, not finance.

Training Tasks:

Masked Language Modeling: Predict hidden words
"I [MASK] this book" → predict "read"
Next Sentence Prediction: Determine if sentence B follows sentence A
Helps understand relationships between sentences

Dynamic Embeddings

Here's the key insight: BERT's embeddings change based on context. The token "bank" starts with one vector, but as it flows through 12+ layers of transformers, each layer updates it based on surrounding words.

By the final layer, "bank" in "river bank" has a completely different representation than "bank" in "savings bank," even though they started with the same token embedding.

This is why BERT and similar models are called contextual or dynamic embedding models.

Modern LLM Embeddings

Today's large language models like GPT-4, Claude, and DeepSeek use sophisticated embedding layers that are trained end-to-end with the entire model.

How LLM Embeddings Work

1. Token Embeddings
Each token gets a learnable vector. These are stored in an embedding matrix: Vocabulary Size × Embedding Dimension

2. Positional Embeddings
Since transformers process all tokens simultaneously, we add position information. The model learns where each token sits in the sequence.

3. Combined Representation
Token embedding + Positional embedding = Initial representation fed to the first transformer layer

4. Contextual Transformations
As representations flow through transformer layers, self-attention and feed-forward networks continuously update them based on context.

The Lookup Table Concept

The embedding layer works like a lookup table:

import torch.nn as nn

# Create embedding layer: 50,000 vocabulary, 768 dimensions
embedding = nn.Embedding(50000, 768)

# Token IDs: [123, 456, 789]
token_ids = torch.tensor([123, 456, 789])

# Get embeddings: returns 3 vectors of 768 dimensions each
embeddings = embedding(token_ids)

Each token ID simply selects the corresponding row from the embedding matrix. No complex computation—just array indexing.

Training Embeddings

Unlike Word2Vec, which is trained separately, LLM embeddings are trained as part of the entire model. They're optimized specifically for the model's task, making them more powerful but also more specialized.

Reference: DeepSeek-R1-Distill-Qwen-1.5B uses 1,536-dimensional embeddings. Larger models like DeepSeek-V3 use 7,168 dimensions.

Creating Embeddings with Gemini

Let's see how to create embeddings in practice using Google's Gemini API. This is one of the simplest ways to get started with embeddings.

Setup

First, install the required package and set up your API key:

pip install google-generativeai

import google.generativeai as genai

# Configure with your API key
genai.configure(api_key="YOUR_API_KEY")

Generate Embeddings

# Choose the embedding model
model = 'models/text-embedding-004'

# Text to embed
texts = [
    "The cat sat on the mat",
    "Dogs are loyal companions",
    "Python is a programming language",
    "Machine learning models process data"
]

# Generate embeddings
embeddings = []
for text in texts:
    result = genai.embed_content(
        model=model,
        content=text,
        task_type="retrieval_document"
    )
    embeddings.append(result['embedding'])

print(f"Embedding dimension: {len(embeddings[0])}")
print(f"First few values: {embeddings[0][:5]}")

Finding Similarity

Once you have embeddings, you can compute similarity between texts:

import numpy as np
from numpy.linalg import norm

def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

# Compare first two sentences
similarity = cosine_similarity(embeddings[0], embeddings[1])
print(f"Similarity between 'cat' and 'dog' sentences: {similarity:.4f}")

# Compare cat/dog vs cat/programming
similarity2 = cosine_similarity(embeddings[0], embeddings[2])
print(f"Similarity between 'cat' and 'Python' sentences: {similarity2:.4f}")

Expected Output:

Embedding dimension: 768
Similarity between 'cat' and 'dog' sentences: 0.7234
Similarity between 'cat' and 'Python' sentences: 0.3421

The first pair (cat and dog sentences) has higher similarity because both discuss animals, while the cat and programming sentences are semantically distant.

Task-Specific Embeddings

Gemini allows you to specify the task type for optimized embeddings:

# For search queries
query_embedding = genai.embed_content(
    model=model,
    content="What are the best practices for training neural networks?",
    task_type="retrieval_query"
)

# For documents to be searched
doc_embedding = genai.embed_content(
    model=model,
    content="Neural network training requires careful hyperparameter tuning...",
    task_type="retrieval_document"
)

# For semantic similarity tasks
similarity_embedding = genai.embed_content(
    model=model,
    content="This is a sample sentence",
    task_type="semantic_similarity"
)

Different task types optimize the embeddings for specific use cases like search, classification, or clustering.

Visualizing Embeddings as Networks

One powerful way to understand embeddings is viewing them as networks:

Nodes: Tokens
Edges: Connect tokens with similar embeddings (high cosine similarity)

For the sentence "AI agents will be the most hot topic of artificial intelligence in 2025", we can:

Get embeddings for each token
Find the 20 most similar tokens to each
Draw connections between related tokens

The resulting graph reveals semantic clusters:

"AI," "artificial," "intelligence" form one cluster
"topic," "hot" form another
Time-related terms cluster together

These visualizations make the abstract concept of embedding space tangible.

Token Variations

Embeddings capture subtle variations. The word "list" might appear as:

"list" (lowercase)
"List" (capitalized)
"_list" (with prefix)
"lists" (plural)

Each gets its own embedding, but they cluster tightly together in embedding space because they're semantically similar.

Practical Implications

Understanding embeddings helps you:

1. Debug Model Behavior

If your model confuses two concepts, check if their embeddings are too similar. This reveals training data biases or vocabulary issues.

2. Improve Retrieval Systems

Semantic search relies on finding documents with similar embeddings to your query. Better embeddings = better search results.

3. Detect Bias

Embeddings encode societal biases present in training data. Analyzing embedding geometry can reveal problematic associations (e.g., gender stereotypes).

4. Fine-tune Efficiently

When fine-tuning models, you can sometimes freeze most layers and only update embeddings, drastically reducing compute requirements.

5. Build Better Prompts

Understanding that similar words have similar embeddings helps you craft prompts that guide the model toward your intended meaning.

Key Takeaways

Embeddings are representations, not magic
They're learned patterns from data, encoded as numbers. Understanding this demystifies how LLMs work.

Context matters
Modern embeddings are dynamic. The same word means different things in different contexts, and embeddings reflect this.

Bigger isn't always better
Larger embedding dimensions capture more nuance but require more data and compute. There's a sweet spot for each use case.

Embeddings encode biases
Whatever patterns exist in training data—including harmful biases—get encoded in embeddings. This is both a feature and a challenge.

They're trainable
Unlike fixed lookup tables, embedding layers learn during training, optimizing for the specific task at hand.

Conclusion

Embeddings bridge the gap between discrete symbols (words, tokens) and continuous mathematics (neural networks). They're the reason LLMs can understand nuance, context, and meaning.

From simple statistical methods like TF-IDF to sophisticated contextual representations in modern transformers, embeddings have evolved to become increasingly powerful. Today's models don't just map words to vectors—they create rich, context-aware representations that capture the subtleties of language.

The next time you interact with an AI, remember: your words are immediately transformed into high-dimensional vectors, where semantic relationships become geometric ones. That transformation—that embedding—is where the magic begins.

References

Hugging Face: LLM Embedding Primer - Interactive guide to embeddings
Jay Alammar's Illustrated Word2Vec - Visual explanations of Word2Vec
Sebastian Raschka's Blog - Deep learning and LLM tutorials