4 Text Embeddings

Chapter Overview

This chapter covers text embeddings—the most mature and widely used embedding type. We explore what text embeddings are, when to use them, and the practical applications they enable. An optional advanced section explains how the underlying models learn to create these representations.

4.1 What Are Text Embeddings?

Text embeddings convert words, sentences, or documents into dense numerical vectors that capture semantic meaning. Unlike simple approaches like bag-of-words or TF-IDF, embeddings understand that “happy” and “joyful” are related, even though they share no letters.

The key insight: text that appears in similar contexts should have similar embeddings. This emerges from training on massive text corpora where the model learns to predict words from their surrounding context.

4.2 Word Embeddings

The foundation of modern NLP, word embeddings map individual words to vectors:

"""
Word Embeddings: From Words to Vectors

Word embeddings capture semantic relationships between individual words.
Words with similar meanings cluster together in the embedding space.
"""

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Use a sentence model to embed individual words
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed words across different categories
words = {
    'animals': ['cat', 'dog', 'elephant', 'whale'],
    'vehicles': ['car', 'truck', 'airplane', 'boat'],
    'colors': ['red', 'blue', 'green', 'yellow'],
}

all_words = [w for group in words.values() for w in group]
embeddings = model.encode(all_words)

# Show that words cluster by category
print("Word similarities (same category = higher similarity):\n")
print("Within categories:")
print(f"  cat ↔ dog:      {cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]:.3f}")
print(f"  car ↔ truck:    {cosine_similarity([embeddings[4]], [embeddings[5]])[0][0]:.3f}")
print(f"  red ↔ blue:     {cosine_similarity([embeddings[8]], [embeddings[9]])[0][0]:.3f}")

print("\nAcross categories:")
print(f"  cat ↔ car:      {cosine_similarity([embeddings[0]], [embeddings[4]])[0][0]:.3f}")
print(f"  dog ↔ red:      {cosine_similarity([embeddings[1]], [embeddings[8]])[0][0]:.3f}")

Word similarities (same category = higher similarity):

Within categories:
  cat ↔ dog:      0.661
  car ↔ truck:    0.689
  red ↔ blue:     0.729

Across categories:
  cat ↔ car:      0.463
  dog ↔ red:      0.377

Key characteristics:

One vector per word (static, context-independent in classic models)
Typically 100-300 dimensions
Captures synonyms, analogies, and semantic relationships

4.3 Sentence and Document Embeddings

Modern applications need to embed entire sentences or documents:

"""
Sentence Embeddings: Capturing Complete Thoughts

Sentence embeddings represent the meaning of entire sentences,
enabling semantic search and similarity comparison.
"""

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentences with similar meaning but different words
sentences = [
    "The quick brown fox jumps over the lazy dog",
    "A fast auburn fox leaps above a sleepy canine",
    "Machine learning models require lots of training data",
    "AI systems need substantial amounts of examples to learn",
]

embeddings = model.encode(sentences)

print("Sentence similarities:\n")
print("Similar meaning (paraphrases):")
print(f"  Sentence 1 ↔ 2: {cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]:.3f}")
print(f"  Sentence 3 ↔ 4: {cosine_similarity([embeddings[2]], [embeddings[3]])[0][0]:.3f}")

print("\nDifferent topics:")
print(f"  Sentence 1 ↔ 3: {cosine_similarity([embeddings[0]], [embeddings[2]])[0][0]:.3f}")

Sentence similarities:

Similar meaning (paraphrases):
  Sentence 1 ↔ 2: 0.733
  Sentence 3 ↔ 4: 0.525

Different topics:
  Sentence 1 ↔ 3: -0.024

4.4 When to Use Text Embeddings

Text embeddings are the right choice for:

Text classification, clustering, and sentiment analysis (see Section 4.6)
Question answering and RAG systems (see Chapter 11)
Chatbots and conversational AI—intent matching, response selection (see Section 11.6)
Summarization—finding representative sentences (see Section 11.7)
Semantic search—finding documents by meaning (see Chapter 12)
Recommendation systems—content-based filtering (see Chapter 13)
Customer support—ticket routing, finding similar issues (see Chapter 26)
Content moderation—detecting similar problematic content (see Section 26.7)
Duplicate and near-duplicate detection (see Chapter 28)
Entity resolution—matching names and descriptions (see Chapter 28)
Machine translation—cross-lingual embeddings (see Chapter 35)

4.5 Popular Text Embedding Models

Popular text embedding models
Model	Dimensions	Speed	Quality	Best For
all-MiniLM-L6-v2	384	Fast	Good	General purpose
all-mpnet-base-v2	768	Medium	Better	Higher quality needs
text-embedding-3-small	1536	API	Excellent	Production systems
text-embedding-3-large	3072	API	Best	Maximum quality

4.6 Classification, Clustering, and Sentiment Analysis

Once you have text embeddings, three foundational tasks become straightforward: classification (assigning labels), clustering (discovering groups), and sentiment analysis (a special case of classification). All three leverage the same principle—similar texts have similar embeddings.

Classification with embeddings:

Train a simple classifier on top of frozen embeddings, or use nearest-neighbor approaches. The k-NN (k-nearest neighbors) method shown below works as follows: during training, embed each text and store the embedding alongside its label. To predict a new text’s label, embed it, find the k training embeddings most similar to it (using cosine similarity), and return the most common label among those neighbors.

Show Text Classifier

import numpy as np
from collections import Counter


class EmbeddingClassifier:
    """Simple k-NN classifier using embeddings."""

    def __init__(self, encoder, k: int = 5):
        self.encoder = encoder
        self.k = k
        self.embeddings = []
        self.labels = []

    def fit(self, texts: list, labels: list):
        """Embed texts and store with their labels."""
        self.embeddings = [self.encoder.encode(text) for text in texts]
        self.labels = labels

    def predict(self, text: str) -> str:
        """Predict label using k-NN."""
        query_emb = self.encoder.encode(text)

        # Cosine similarity: (A · B) / (||A|| × ||B||)
        distances = []
        for i, emb in enumerate(self.embeddings):
            dist = np.dot(query_emb, emb) / (np.linalg.norm(query_emb) * np.linalg.norm(emb))
            distances.append((dist, self.labels[i]))

        # Get k nearest neighbors
        distances.sort(reverse=True)
        k_nearest = [label for _, label in distances[: self.k]]

        # Return most common label
        return Counter(k_nearest).most_common(1)[0][0]


# Example: Sentiment classification
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer('all-MiniLM-L6-v2')

classifier = EmbeddingClassifier(encoder, k=3)
classifier.fit(
    texts=["Great product!", "Loved it", "Terrible", "Waste of money", "Amazing quality"],
    labels=["positive", "positive", "negative", "negative", "positive"],
)
print(f"Prediction for 'This is wonderful!': {classifier.predict('This is wonderful!')}")

Prediction for 'This is wonderful!': positive

Classification Best Practices

Few-shot is often enough: With good embeddings, 10-50 examples per class often suffices (see Chapter 16 for few-shot techniques)
k-NN for simplicity: No training required, just store examples
Logistic regression for speed: Train a simple linear classifier on embeddings
Fine-tune for best quality: When you have thousands of examples, fine-tune the embedding model itself (see Chapter 14)

Clustering with embeddings:

Clustering discovers natural groups in your data without predefined labels. Since similar texts have similar embeddings, texts on the same topic will cluster together in embedding space.

K-means is a popular clustering algorithm. You specify k (the number of clusters), and the algorithm finds k groups by positioning a centroid (center point) for each cluster. Each text belongs to the cluster whose centroid is closest to its embedding.

The algorithm works as follows: first, embed all texts. Then pick k random embeddings as initial centroids—the algorithm needs starting points before it can begin refining. Next, iterate until convergence: (1) assign each embedding to its nearest centroid (measured by Euclidean distance), and (2) update each centroid to be the mean of its assigned embeddings. The algorithm converges when assignments stop changing.

Figure 4.1: K-means clustering in 2D embedding space. Points are colored by cluster assignment, with centroids marked as X.

Show Text Clustering

import numpy as np
from typing import List, Dict


class EmbeddingClusterer:
    """K-means clustering on text embeddings."""

    def __init__(self, encoder, n_clusters: int = 3):
        self.encoder = encoder
        self.n_clusters = n_clusters
        self.centroids = None

    def fit(self, texts: List[str], max_iters: int = 100):
        """Cluster texts and return assignments."""
        embeddings = np.array([self.encoder.encode(text) for text in texts])

        # Initialize centroids by picking k random embeddings as starting points
        indices = np.random.choice(len(embeddings), self.n_clusters, replace=False)
        self.centroids = embeddings[indices].copy()

        for _ in range(max_iters):
            # Assign points to nearest centroid
            assignments = []
            for emb in embeddings:
                distances = [np.linalg.norm(emb - c) for c in self.centroids]
                assignments.append(np.argmin(distances))

            # Update centroids
            new_centroids = []
            for i in range(self.n_clusters):
                cluster_points = embeddings[np.array(assignments) == i]
                if len(cluster_points) > 0:
                    new_centroids.append(cluster_points.mean(axis=0))
                else:
                    new_centroids.append(self.centroids[i])

            self.centroids = np.array(new_centroids)

        return assignments

    def get_cluster_examples(self, texts: List[str], assignments: List[int]) -> Dict[int, List[str]]:
        """Group texts by cluster."""
        clusters = {i: [] for i in range(self.n_clusters)}
        for text, cluster_id in zip(texts, assignments):
            clusters[cluster_id].append(text)
        return clusters


# Example: Topic discovery
texts = [
    # Cooking
    "Chop the onions and garlic finely",
    "Simmer the sauce for twenty minutes",
    "Season with salt and pepper to taste",
    "Preheat the oven to 350 degrees",
    # Space
    "The telescope discovered a new exoplanet",
    "Astronauts completed their spacewalk today",
    "The Mars rover collected soil samples",
    "A new comet is visible this month",
    # Weather
    "Heavy rain expected throughout the weekend",
    "Temperatures will drop below freezing tonight",
    "A warm front is moving in from the south",
    "Clear skies and sunshine forecast for Monday",
]

np.random.seed(42)  # For reproducible results
clusterer = EmbeddingClusterer(encoder, n_clusters=3)
assignments = clusterer.fit(texts)
clusters = clusterer.get_cluster_examples(texts, assignments)
for cluster_id, examples in clusters.items():
    print(f"\nCluster {cluster_id}:")
    for text in examples:
        print(f"  - {text}")


Cluster 0:
  - The Mars rover collected soil samples
  - A new comet is visible this month
  - A warm front is moving in from the south

Cluster 1:
  - Preheat the oven to 350 degrees
  - Astronauts completed their spacewalk today
  - Heavy rain expected throughout the weekend
  - Temperatures will drop below freezing tonight
  - Clear skies and sunshine forecast for Monday

Cluster 2:
  - Chop the onions and garlic finely
  - Simmer the sauce for twenty minutes
  - Season with salt and pepper to taste
  - The telescope discovered a new exoplanet

Notice that some items may appear in unexpected clusters. Embeddings capture semantic similarity that doesn’t always match our intuitive topic categories—“Preheat the oven to 350 degrees” mentions temperature, which may pull it toward weather texts, while “A warm front is moving in” uses directional language similar to space descriptions. This is a feature, not a bug: embeddings capture meaning patterns that humans might overlook.

Clustering Best Practices

Choose k carefully: Use elbow method or silhouette scores to find optimal cluster count
HDBSCAN for unknown k: Unlike k-means, HDBSCAN doesn’t require specifying cluster count upfront—it discovers clusters based on density and labels sparse points as outliers rather than forcing them into clusters
Reduce dimensions first: For visualization, use UMAP or t-SNE on embeddings
Label clusters post-hoc: Examine cluster members to assign meaningful names

Sentiment analysis:

Sentiment analysis determines whether text expresses positive, negative, or neutral opinions. While you could treat this as classification (train on labeled examples), an elegant alternative uses anchor texts—words or phrases with known sentiment.

The approach works as follows: embed a set of clearly positive words (“excellent”, “amazing”, “love it”) and compute their centroid. Do the same for negative words. To analyze new text, embed it and measure which centroid it’s closer to. The difference in distances gives both a label and a confidence score—text much closer to the positive centroid is strongly positive.

Figure 4.2: Sentiment analysis using anchor texts. Positive and negative anchor words form centroids. New text is classified by which centroid it’s closer to.

Show Sentiment Analyzer

import numpy as np
from typing import Tuple


class SentimentAnalyzer:
    """Embedding-based sentiment analysis using anchor texts."""

    def __init__(self, encoder):
        self.encoder = encoder
        # Anchor texts define the sentiment space
        positive_anchors = ["excellent", "amazing", "wonderful", "fantastic", "love it"]
        negative_anchors = ["terrible", "awful", "horrible", "hate it", "worst ever"]

        # Compute anchor centroids: average the embeddings of all anchor words
        # to find the "center" of positive/negative regions in embedding space.
        # Using multiple anchors makes the centroid more robust than any single word.
        self.positive_centroid = np.mean(
            [encoder.encode(t) for t in positive_anchors], axis=0
        )
        self.negative_centroid = np.mean(
            [encoder.encode(t) for t in negative_anchors], axis=0
        )

    def analyze(self, text: str) -> Tuple[str, float]:
        """
        Return sentiment label and confidence score.
        Score ranges from -1 (negative) to +1 (positive).
        """
        emb = self.encoder.encode(text)

        # Cosine similarity: (A · B) / (||A|| × ||B||)
        pos_sim = np.dot(emb, self.positive_centroid) / (
            np.linalg.norm(emb) * np.linalg.norm(self.positive_centroid)
        )
        neg_sim = np.dot(emb, self.negative_centroid) / (
            np.linalg.norm(emb) * np.linalg.norm(self.negative_centroid)
        )

        # Score: positive if closer to positive centroid
        score = pos_sim - neg_sim
        label = "positive" if score > 0 else "negative"
        confidence = abs(score)

        return label, confidence


# Example usage
np.random.seed(42)
analyzer = SentimentAnalyzer(encoder)
for text in ["This product exceeded expectations!", "Complete waste of money"]:
    label, conf = analyzer.analyze(text)
    print(f"'{text[:30]}...' -> {label} ({conf:.2f})")

'This product exceeded expectat...' -> positive (0.02)
'Complete waste of money...' -> negative (0.07)

The number in parentheses is the confidence score—the difference between cosine similarities to the positive and negative centroids. These values appear low because general-purpose embeddings capture broad semantics, not just sentiment. The key insight is that the relative scores still correctly distinguish positive from negative text, even when absolute differences are small. For production sentiment analysis, you’d typically fine-tune embeddings on sentiment-labeled data (see Chapter 14).

Sentiment Analysis Best Practices

Domain matters: Financial sentiment differs from product reviews—use domain-specific anchors (see Chapter 29 for financial sentiment)
Beyond binary: Instead of just positive/negative centroids, create centroids for multiple emotions (joy, anger, sadness, fear, surprise). Measure distance to each and return the closest emotion, or return a distribution across all emotions for nuanced analysis.
Aspect-based: Reviews often mix sentiment across topics (“great battery, terrible screen”). First extract aspects (product features, service elements), then run sentiment analysis on each aspect separately to understand what users love and hate.

4.7 Advanced: How Text Embedding Models Learn

Optional Section

This section explains how text embedding models actually learn. Understanding these fundamentals helps you choose the right model and diagnose issues. Skip this if you just need to use embeddings.

4.7.1 Word2Vec: Static Word Embeddings

Word2Vec (Mikolov et al. 2013) revolutionized NLP by showing that simple neural networks could learn rich semantic representations from raw text. The key insight: words appearing in similar contexts should have similar embeddings.

Word2Vec uses a technique called skip-gram: given a target word, predict the words that typically appear nearby. For example, given “cat”, predict that “furry”, “pet”, and “meow” often appear in the same sentences. By training on millions of such predictions, the model learns that “cat” and “dog” should have similar embeddings (both appear near “pet”, “feed”, “vet”) while “cat” and “algebra” should be far apart.

Key Resources

Word2Vec paper (Mikolov et al., 2013) — introduced Word2Vec and skip-gram
Negative sampling paper (Mikolov et al., 2013) — improved training efficiency
Gensim library — popular Python library for training Word2Vec

Figure 4.3: Skip-gram: given a target word, predict context words within a window. Here, ‘sat’ predicts ‘the’, ‘cat’, ‘on’, ‘the’.

The implementation below shows the core training loop: for each target-context pair from real text, push their embeddings closer together; for random “negative” pairs, push them apart. After enough iterations, similar words cluster in embedding space.

Figure 4.4: Training pushes context words closer to target (green arrows) and negative samples farther away (red arrows).

"""
Word2Vec Skip-Gram: Simplified Implementation
"""

import numpy as np

# Training data: sentences we learn word relationships from
corpus = [["the", "cat", "sat", "on", "the", "mat"], ["the", "dog", "sat", "on", "the", "rug"]]

# Vocabulary: unique words extracted from corpus
vocab = list(set(word for sentence in corpus for word in sentence))
vocab_size = len(vocab)
embedding_dim = 4  # Small for demo; production uses 100-300 dimensions
word_to_idx = {w: i for i, w in enumerate(vocab)}

# Initialize two embedding matrices with small random values
np.random.seed(42)
W_target = np.random.randn(vocab_size, embedding_dim) * 0.1
W_context = np.random.randn(vocab_size, embedding_dim) * 0.1


def sigmoid(x):
    """Squash any value to range (0, 1). Clip prevents overflow."""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))


def train_skipgram_pair(target_word, context_word, negative_words, lr=0.1):
    """Train on one (target, context) pair with negative sampling.

    Example: in "the cat sat", target="cat", context="the" or "sat",
    negatives=["dog", "rug"] (random words not in context).
    """
    global W_target, W_context

    t_idx = word_to_idx[target_word]
    c_idx = word_to_idx[context_word]

    target_emb = W_target[t_idx]
    context_emb = W_context[c_idx]

    # Positive example: target and context should be similar
    score = np.dot(target_emb, context_emb)
    pred = sigmoid(score)
    W_target[t_idx] -= lr * (pred - 1) * context_emb
    W_context[c_idx] -= lr * (pred - 1) * target_emb

    # Negative examples: push apart words that don't appear together.
    # Without this, model would collapse all embeddings to the same point.
    for neg_word in negative_words:
        n_idx = word_to_idx[neg_word]
        neg_emb = W_context[n_idx]
        score = np.dot(target_emb, neg_emb)
        pred = sigmoid(score)
        W_target[t_idx] -= lr * pred * neg_emb
        W_context[n_idx] -= lr * pred * target_emb


# Each epoch = one pass through corpus. Multiple passes needed because
# each example only nudges embeddings slightly; repetition reinforces patterns.
for epoch in range(50):
    for sentence in corpus:
        for i, target in enumerate(sentence):
            # Context = words within 2 positions of target
            start = max(0, i - 2)
            end = min(len(sentence), i + 3)
            context_words = [sentence[j] for j in range(start, end) if j != i]

            # Negatives = random words not appearing near target
            negatives = [w for w in vocab if w not in context_words and w != target][:2]

            for context in context_words:
                train_skipgram_pair(target, context, negatives)


# cosine_similarity = (A · B) / (||A|| × ||B||)
def cosine_similarity(w1, w2):
    v1, v2 = W_target[word_to_idx[w1]], W_target[word_to_idx[w2]]
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))


print("Learned similarities:")
print(f"  cat ↔ dog: {cosine_similarity('cat', 'dog'):.3f}")
print(f"  cat ↔ mat: {cosine_similarity('cat', 'mat'):.3f}")

Learned similarities:
  cat ↔ dog: 0.899
  cat ↔ mat: 0.945

Why is cat↔︎mat higher than cat↔︎dog? In our tiny corpus, “cat” and “mat” appear in the same sentence, sharing more context words. Word2Vec learns from co-occurrence patterns in the training data—with millions of sentences, “cat” and “dog” would cluster together as animals, but our two-sentence corpus doesn’t capture that relationship.

4.7.2 BERT: Contextual Token Embeddings

The transformer architecture (Vaswani et al. 2017) and BERT (Devlin et al. 2018) introduced contextual embeddings—the same word gets different representations based on context.

The key innovation is the attention mechanism: when processing a word, the model can “look at” all other words in the sentence to determine meaning. This is fundamentally different from Word2Vec:

Word2Vec: Context matters during training (learning from nearby words), but produces static embeddings. Once trained, “bank” always maps to the same vector—the model can’t tell which meaning you intend.
Transformers: Context matters at inference time. Each time you encode a sentence, the model examines all surrounding words to compute a contextual embedding. “Bank” gets different vectors depending on whether the sentence mentions money or rivers.

Key Papers

Attention Is All You Need (Vaswani et al., 2017) — introduced the transformer architecture
BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018) — applied transformers to create contextual embeddings

Figure 4.5: Attention mechanism: when encoding ‘bank’, the model attends to other words to determine meaning. ‘Money’ and ‘deposited’ signal financial meaning.

Unlike Word2Vec (where we showed training from scratch), we won’t implement transformer training here. Why? Scale. Word2Vec trains in minutes on a laptop with a few sentences. BERT was trained on 3.3 billion words using 64 TPUs for 4 days. The architecture is also far more complex—multi-head attention, layer normalization, positional encodings, and masked language modeling objectives.

Instead, we use pre-trained models. The code below loads a model that’s already learned contextual representations from massive text corpora:

"""
Contextual Embeddings: Same Word, Different Meanings
"""

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Same word "bank" in different contexts
sentences = [
    "I deposited money at the bank",      # financial-1
    "The bank approved my loan",          # financial-2
    "We had a picnic on the river bank",  # river-1
    "Fish swim near the bank",            # river-2
]

# Each sentence gets a single embedding capturing its full meaning
embeddings = model.encode(sentences)

print("Sentence similarity (same meaning = higher score):\n")
labels = ["financial-1", "financial-2", "river-1", "river-2"]
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        sim = cosine_similarity([embeddings[i]], [embeddings[j]])[0][0]
        print(f"  {labels[i]:12s} ↔ {labels[j]:12s}: {sim:.3f}")

Sentence similarity (same meaning = higher score):

  financial-1  ↔ financial-2 : 0.486
  financial-1  ↔ river-1     : 0.416
  financial-1  ↔ river-2     : 0.286
  financial-2  ↔ river-1     : 0.288
  financial-2  ↔ river-2     : 0.248
  river-1      ↔ river-2     : 0.382

The financial sentences (0.486) are more similar to each other than to river sentences, and vice versa—but the differences aren’t dramatic. Is this a limitation? It depends on your use case:

For sentence-level tasks (semantic search, document clustering), this is fine. The embeddings correctly distinguish the sentence meanings, just not by a huge margin.
For word-sense disambiguation (detecting that “bank” means different things), you need token-level embeddings instead of sentence embeddings. BERT produces distinct embeddings for “bank” in each context—the pooling step that creates a single sentence embedding averages this information away.

To get the contextual embedding for a specific word, use the underlying BERT model directly and extract the token’s hidden state, rather than using a Sentence Transformer that pools all tokens together.

Why transformers dominate:

Parallelization: All words processed simultaneously. Earlier models (RNNs—Recurrent Neural Networks) processed words one at a time, left to right. This was slow and made training on large datasets impractical.
Long-range dependencies: Attention connects distant words directly. In “The cat that I saw yesterday at the park was sleeping”, attention links “cat” to “sleeping” without processing all words in between.
Transfer learning: Pre-trained models work across many tasks without retraining from scratch.
Scalability: Performance improves predictably with more data and compute.

4.7.3 Sentence Transformers: Whole-Sentence Embeddings

The previous section covered BERT—a powerful architecture that produces token-level embeddings. But throughout this chapter, we’ve been using SentenceTransformer to get sentence-level embeddings. What’s the connection?

Sentence Transformers (Reimers and Gurevych 2019) are BERT models fine-tuned specifically for producing useful sentence embeddings. The problem: raw BERT produces one embedding per token, not per sentence. The naive fix—averaging all token embeddings—works poorly because it weights filler words (“the”, “a”) equally with meaningful ones.

Sentence Transformers solve this using a siamese architecture: two identical BERT models process two sentences, then a pooling layer (typically mean or [CLS] token) produces one embedding each. Training uses contrastive learning—the model learns to produce similar embeddings for related sentences (paraphrases, question-answer pairs) and dissimilar embeddings for unrelated ones.

Key Resources

Sentence-BERT paper (Reimers & Gurevych, 2019) — the original Sentence Transformers paper
Sentence Transformers library — documentation and pre-trained models
Hugging Face model hub — browse available models

Figure 4.6: Sentence Transformer training: a siamese network processes sentence pairs through identical BERT encoders. Paraphrases are pushed together; unrelated sentences are pushed apart.

The key insight: by training on millions of sentence pairs (from datasets like NLI, paraphrase corpora, and question-answer pairs), the model learns to capture semantic similarity—not just surface-level word overlap. This is why Sentence Transformers produce excellent embeddings out of the box, unlike raw BERT which needs task-specific fine-tuning.

For a deeper dive into contrastive learning techniques, see Chapter 15. For training your own sentence embeddings, see Chapter 14.

4.8 Key Takeaways

Text embeddings convert words, sentences, or documents into vectors capturing semantic meaning
Similar text → similar vectors: This enables semantic search, clustering, and classification without explicit rules
Popular models range from fast (MiniLM) to high-quality (OpenAI text-embedding-3) depending on your needs
Word2Vec learns from word co-occurrence patterns—words in similar contexts get similar embeddings
Transformers (BERT) create contextual embeddings where the same word gets different vectors based on surrounding context
Sentence Transformers adapt these for producing single embeddings for entire sentences

4.9 Looking Ahead

Now that you understand text embeddings, Chapter 5 explores how similar principles apply to visual data—images and video.

4.10 Further Reading

Mikolov, T., et al. (2013). “Efficient Estimation of Word Representations in Vector Space.” arXiv:1301.3781
Vaswani, A., et al. (2017). “Attention Is All You Need.” NeurIPS
Devlin, J., et al. (2018). “BERT: Pre-training of Deep Bidirectional Transformers.” arXiv:1810.04805
Reimers, N. & Gurevych, I. (2019). “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” arXiv:1908.10084