Measuring embedding quality is deceptively difficult. Unlike classification accuracy or regression error, embedding quality is multidimensional: retrieval performance, representation quality, downstream task accuracy, and user satisfaction all matter but may not correlate. This chapter provides a comprehensive framework for evaluating embeddings across intrinsic quality metrics (isotropy, uniformity, alignment), retrieval metrics (Recall@K, MAP, NDCG, MRR), human evaluation protocols, domain-specific metrics, and statistical rigor. We cover evaluation at trillion-row scale including sampling strategies, efficient computation, and continuous monitoring—enabling you to measure what matters and catch degradation before users notice.
Embedding evaluation differs fundamentally from traditional ML evaluation. A classifier has a clear target: predict the correct label. An embedding has no single correct answer—quality depends on how well the embedding supports downstream tasks, which may not be known at training time. This creates a challenging evaluation landscape requiring multiple complementary metrics, careful experimental design, and statistical rigor.
21.1 Intrinsic Quality Metrics
Intrinsic metrics measure embedding quality without reference to a specific downstream task. They capture properties of the embedding space itself—how well-distributed vectors are, how much of the space is utilized, and whether semantic relationships are preserved. These metrics detect problems even without labeled evaluation data.
21.1.1 Isotropy: Are Embeddings Well-Distributed?
Isotropy measures how uniformly embeddings are distributed across the vector space. Perfectly isotropic embeddings have equal variance in all directions—no dimension dominates, and vectors aren’t clustered in a narrow cone.
Why isotropy matters:
Low isotropy means embeddings cluster in a small region, reducing discriminative power
Highly anisotropic embeddings waste dimensions on directions with little variance
Similarity search becomes unreliable when all vectors are similar to each other
Show isotropy measurement implementation
import torchimport torch.nn.functional as Fimport numpy as npdef compute_isotropy(embeddings: torch.Tensor) ->dict:""" Compute isotropy metrics for a set of embeddings. Isotropy measures how uniformly embeddings are distributed in the space. Perfect isotropy = 1.0, all vectors identical = 0.0 Args: embeddings: Tensor of shape (n_samples, embedding_dim) Returns: Dictionary with isotropy metrics """# Center embeddings centered = embeddings - embeddings.mean(dim=0)# Compute covariance matrix n = embeddings.shape[0] cov = (centered.T @ centered) / (n -1)# Eigenvalue decomposition eigenvalues = torch.linalg.eigvalsh(cov) eigenvalues = eigenvalues.clamp(min=1e-10) # Numerical stability# Sort descending eigenvalues = eigenvalues.flip(0)# Isotropy metrics# 1. Partition function isotropy (Mu et al., 2018)# Measures how much the eigenvalue distribution deviates from uniform Z = eigenvalues.sum() partition_isotropy = (eigenvalues.min() *len(eigenvalues)) / Z# 2. Effective dimensionality (participation ratio)# How many dimensions are "active" effective_dim = (eigenvalues.sum() **2) / (eigenvalues **2).sum()# 3. Explained variance ratio# What fraction of variance is in top-k dimensions total_var = eigenvalues.sum() top_10_var = eigenvalues[:10].sum() / total_var top_50_var = eigenvalues[:50].sum() / total_var# 4. Average cosine similarity (should be ~0 for isotropic) normalized = F.normalize(embeddings, dim=1) cos_sim_matrix = normalized @ normalized.T# Exclude diagonal mask =~torch.eye(n, dtype=torch.bool, device=embeddings.device) avg_cos_sim = cos_sim_matrix[mask].mean()return {"partition_isotropy": partition_isotropy.item(),"effective_dimensionality": effective_dim.item(),"effective_dim_ratio": effective_dim.item() / embeddings.shape[1],"top_10_variance_ratio": top_10_var.item(),"top_50_variance_ratio": top_50_var.item(),"avg_cosine_similarity": avg_cos_sim.item(),"embedding_dim": embeddings.shape[1] }# Example: Compare isotropic vs anisotropic embeddingstorch.manual_seed(42)# Well-distributed embeddings (more isotropic)isotropic_emb = torch.randn(1000, 256)# Poorly distributed (anisotropic - most variance in few dimensions)anisotropic_emb = torch.randn(1000, 256)anisotropic_emb[:, :10] *=10# First 10 dims dominateprint("Isotropic embeddings:")iso_metrics = compute_isotropy(isotropic_emb)print(f" Partition isotropy: {iso_metrics['partition_isotropy']:.4f}")print(f" Effective dim ratio: {iso_metrics['effective_dim_ratio']:.2%}")print(f" Avg cosine similarity: {iso_metrics['avg_cosine_similarity']:.4f}")print("\nAnisotropic embeddings:")aniso_metrics = compute_isotropy(anisotropic_emb)print(f" Partition isotropy: {aniso_metrics['partition_isotropy']:.4f}")print(f" Effective dim ratio: {aniso_metrics['effective_dim_ratio']:.2%}")print(f" Avg cosine similarity: {aniso_metrics['avg_cosine_similarity']:.4f}")
Partition isotropy > 0.5 (higher is better, max 1.0)
Effective dimensionality > 50% of embedding dimension
Average cosine similarity close to 0 (typically -0.01 to 0.05)
Top 10 dimensions explain < 20% of variance
Warning signs:
Average cosine similarity > 0.3 (vectors too similar)
Effective dimensionality < 20% of embedding dimension
Top 10 dimensions explain > 50% of variance
Note: Some anisotropy is expected and even desirable—it reflects the structure of your data. The key is ensuring useful dimensions aren’t wasted on noise.
21.1.2 Uniformity and Alignment
Wang & Isola (2020) introduced uniformity and alignment as complementary metrics for contrastive embeddings:
Alignment: Similar items should have similar embeddings (low distance between positives)
Uniformity: Embeddings should be uniformly distributed on the unit hypersphere (maximize use of space)
Show uniformity and alignment metrics
import torchimport torch.nn.functional as Fdef compute_alignment( embeddings: torch.Tensor, positive_pairs: torch.Tensor, alpha: float=2.0) ->float:""" Compute alignment metric: expected distance between positive pairs. Lower is better - positive pairs should be close. Args: embeddings: (n_samples, dim) normalized embeddings positive_pairs: (n_pairs, 2) indices of positive pairs alpha: exponent for distance (default 2 = squared distance) """ emb1 = embeddings[positive_pairs[:, 0]] emb2 = embeddings[positive_pairs[:, 1]]# Squared L2 distance for normalized vectors = 2 - 2*cos_sim distances = (emb1 - emb2).pow(2).sum(dim=1) alignment = distances.pow(alpha /2).mean()return alignment.item()def compute_uniformity( embeddings: torch.Tensor, t: float=2.0, sample_size: int=10000) ->float:""" Compute uniformity metric: how uniformly distributed embeddings are. Lower is better - embeddings should spread across the hypersphere. Based on Wang & Isola (2020). Args: embeddings: (n_samples, dim) normalized embeddings t: temperature parameter (default 2) sample_size: number of pairs to sample for efficiency """ n = embeddings.shape[0]if n * (n -1) //2> sample_size:# Sample pairs for efficiency idx1 = torch.randint(0, n, (sample_size,)) idx2 = torch.randint(0, n, (sample_size,))# Ensure different indices mask = idx1 != idx2 idx1, idx2 = idx1[mask], idx2[mask] emb1, emb2 = embeddings[idx1], embeddings[idx2]else:# Compute all pairs emb1 = embeddings.unsqueeze(1) # (n, 1, dim) emb2 = embeddings.unsqueeze(0) # (1, n, dim)# Squared L2 distance sq_distances = (emb1 - emb2).pow(2).sum(dim=-1)# Uniformity loss (log of average Gaussian kernel) uniformity = torch.log(torch.exp(-t * sq_distances).mean())return uniformity.item()def compute_alignment_uniformity( embeddings: torch.Tensor, positive_pairs: torch.Tensor =None, labels: torch.Tensor =None) ->dict:""" Compute both alignment and uniformity metrics. Args: embeddings: (n_samples, dim) embeddings (will be normalized) positive_pairs: (n_pairs, 2) indices of positive pairs, OR labels: (n_samples,) class labels to generate positive pairs """# Normalize embeddings embeddings = F.normalize(embeddings, dim=1)# Generate positive pairs from labels if neededif positive_pairs isNoneand labels isnotNone: positive_pairs = []for label in labels.unique(): indices = (labels == label).nonzero().squeeze()iflen(indices) >1:# Sample pairs within classfor i inrange(min(len(indices), 50)):for j inrange(i +1, min(len(indices), 50)): positive_pairs.append([indices[i].item(), indices[j].item()]) positive_pairs = torch.tensor(positive_pairs) alignment = compute_alignment(embeddings, positive_pairs) if positive_pairs isnotNoneelseNone uniformity = compute_uniformity(embeddings)return {"alignment": alignment,"uniformity": uniformity,"alignment_uniformity_sum": (alignment + uniformity) if alignment elseNone }# Exampletorch.manual_seed(42)embeddings = torch.randn(500, 128)labels = torch.randint(0, 10, (500,)) # 10 classesmetrics = compute_alignment_uniformity(embeddings, labels=labels)print(f"Alignment: {metrics['alignment']:.4f} (lower = positive pairs closer)")print(f"Uniformity: {metrics['uniformity']:.4f} (lower = more spread out)")
Retrieval metrics measure how well embeddings support similarity search—the most common downstream task. Understanding when to use each metric and how they differ is crucial for meaningful evaluation.
21.2.1 Recall@K: Did We Find the Relevant Items?
Recall@K measures the fraction of relevant items found in the top K results. It answers: “Of all the things I should find, how many did I actually find?”
\[\text{Recall@K} = \frac{|\text{Relevant items in top K}|}{|\text{Total relevant items}|}\]
Show Recall@K implementation
import torchimport numpy as npdef recall_at_k( query_embeddings: torch.Tensor, corpus_embeddings: torch.Tensor, relevance_labels: torch.Tensor, k_values: list= [1, 5, 10, 50, 100]) ->dict:""" Compute Recall@K for embedding retrieval. Args: query_embeddings: (n_queries, dim) corpus_embeddings: (n_corpus, dim) relevance_labels: (n_queries, n_corpus) binary relevance matrix or (n_queries,) with corpus index of single relevant item k_values: list of K values to compute Returns: Dictionary with Recall@K for each K """# Compute similarities query_norm = query_embeddings / query_embeddings.norm(dim=1, keepdim=True) corpus_norm = corpus_embeddings / corpus_embeddings.norm(dim=1, keepdim=True) similarities = query_norm @ corpus_norm.T # (n_queries, n_corpus)# Get rankings rankings = similarities.argsort(dim=1, descending=True) results = {}# Handle single relevant item caseif relevance_labels.dim() ==1:for k in k_values: top_k = rankings[:, :k] hits = (top_k == relevance_labels.unsqueeze(1)).any(dim=1) results[f"recall@{k}"] = hits.float().mean().item()else:# Multiple relevant items casefor k in k_values: top_k = rankings[:, :k] recalls = []for i inrange(len(query_embeddings)): relevant = relevance_labels[i].nonzero().squeeze(-1)iflen(relevant) ==0:continue found = (top_k[i].unsqueeze(1) == relevant.unsqueeze(0)).any(dim=1).sum() recalls.append(found.item() /len(relevant)) results[f"recall@{k}"] = np.mean(recalls) if recalls else0.0return results# Exampletorch.manual_seed(42)n_queries, n_corpus, dim =100, 10000, 256queries = torch.randn(n_queries, dim)corpus = torch.randn(n_corpus, dim)# Each query has one relevant documentrelevant_indices = torch.randint(0, n_corpus, (n_queries,))results = recall_at_k(queries, corpus, relevant_indices)for k, v in results.items():print(f"{k}: {v:.4f}")
Ignores precision, treats all relevant items equally
Precision@K
When false positives are costly (e.g., legal, medical)
Ignores items outside top K
MAP
Comprehensive ranking quality, comparing systems
Assumes binary relevance
MRR
Navigational queries with single correct answer
Only considers first relevant item
NDCG
Graded relevance, nuanced quality assessment
Requires graded judgments, harder to interpret
TipMetric Selection Guidelines
For product search: Use NDCG (users prefer more relevant products) + Recall@100 (coverage)
For document retrieval: Use MAP (comprehensive) + MRR (navigational queries)
For recommendations: Use NDCG@10 (top matters most) + Precision@10 (quality of shown items)
For fraud detection: Use Recall@K (can’t miss fraud) + Precision (avoid alert fatigue)
Always report multiple metrics to get a complete picture.
21.3 Human Evaluation Framework
Automated metrics have limitations. Human evaluation provides ground truth that algorithms can’t capture: subjective relevance, contextual appropriateness, and user satisfaction. This section covers how to collect high-quality human judgments at scale.
21.3.1 Designing Evaluation Tasks
Effective human evaluation requires clear task design:
Show human evaluation task framework
from dataclasses import dataclassfrom typing import List, Optionalfrom enum import Enumclass RelevanceScale(Enum):"""Standard graded relevance scale (TREC-style).""" NOT_RELEVANT =0# Completely irrelevant MARGINALLY =1# Marginally relevant FAIRLY =2# Fairly relevant HIGHLY =3# Highly relevant PERFECTLY =4# Perfect match@dataclassclass EvaluationTask:"""A single human evaluation task.""" task_id: str query: str candidate: str context: Optional[str] =None instructions: str=""def to_annotation_format(self) ->dict:return {"id": self.task_id,"query": self.query,"candidate": self.candidate,"context": self.context,"instructions": self.instructions,"scale": [s.name for s in RelevanceScale] }@dataclassclass AnnotationGuidelines:"""Guidelines for human annotators.""" task_description: str relevance_definitions: dict examples: List[dict] edge_cases: List[str]@staticmethoddef create_search_relevance_guidelines():return AnnotationGuidelines( task_description=""" Rate how well each document answers the given query. Consider: Does it answer the question? Is the information accurate? Would a user be satisfied with this result? """, relevance_definitions={"NOT_RELEVANT": "Document has no useful information for the query","MARGINALLY": "Document is tangentially related but doesn't answer the query","FAIRLY": "Document partially answers the query or provides related info","HIGHLY": "Document substantially answers the query","PERFECTLY": "Document is an ideal answer to the query" }, examples=[ {"query": "How to make sourdough bread","document": "Sourdough bread recipe: Mix flour, water, starter...","rating": "PERFECTLY","reason": "Direct recipe for the query" }, {"query": "How to make sourdough bread","document": "The history of bread dates back 10,000 years...","rating": "NOT_RELEVANT","reason": "About bread history, not how to make sourdough" } ], edge_cases=["If document is relevant but outdated, rate MARGINALLY","If document answers a related but different question, rate FAIRLY","If unsure between two ratings, choose the lower one" ] )class HumanEvaluationPipeline:"""Pipeline for collecting and analyzing human judgments."""def__init__(self, guidelines: AnnotationGuidelines):self.guidelines = guidelinesself.annotations = []self.annotator_stats = {}def create_task_batch(self, queries: List[str], candidates: List[List[str]], n_per_query: int=10 ) -> List[EvaluationTask]:"""Create a batch of evaluation tasks.""" tasks = []for i, (query, cands) inenumerate(zip(queries, candidates)):for j, cand inenumerate(cands[:n_per_query]): tasks.append(EvaluationTask( task_id=f"q{i}_c{j}", query=query, candidate=cand, instructions=self.guidelines.task_description ))return tasksdef compute_inter_annotator_agreement(self, annotations: List[dict] ) ->dict:""" Compute inter-annotator agreement metrics. Returns Cohen's Kappa for pairs and Fleiss' Kappa for groups. """# Group by task task_annotations = {}for ann in annotations: tid = ann["task_id"]if tid notin task_annotations: task_annotations[tid] = [] task_annotations[tid].append(ann["rating"])# Only tasks with multiple annotations multi = {k: v for k, v in task_annotations.items() iflen(v) >=2}ifnot multi:return {"error": "No tasks with multiple annotations"}# Simple agreement rate agreements = []for ratings in multi.values():# Check if all annotators agree agreements.append(1.0iflen(set(ratings)) ==1else0.0)# Pairwise agreement pairwise = []for ratings in multi.values():for i inrange(len(ratings)):for j inrange(i +1, len(ratings)): pairwise.append(1.0if ratings[i] == ratings[j] else0.0)return {"exact_agreement_rate": sum(agreements) /len(agreements),"pairwise_agreement_rate": sum(pairwise) /len(pairwise) if pairwise else0,"tasks_with_multiple_annotations": len(multi) }# Example usageguidelines = AnnotationGuidelines.create_search_relevance_guidelines()pipeline = HumanEvaluationPipeline(guidelines)print("Task description:")print(guidelines.task_description)print("\nRelevance scale:")for level, desc in guidelines.relevance_definitions.items():print(f" {level}: {desc}")
Task description:
Rate how well each document answers the given query.
Consider: Does it answer the question? Is the information accurate?
Would a user be satisfied with this result?
Relevance scale:
NOT_RELEVANT: Document has no useful information for the query
MARGINALLY: Document is tangentially related but doesn't answer the query
FAIRLY: Document partially answers the query or provides related info
HIGHLY: Document substantially answers the query
PERFECTLY: Document is an ideal answer to the query
Speed-accuracy trade-off: Fast annotators often less accurate—monitor speed
Mitigation strategies: - Include 10-15% gold standard tasks for quality monitoring - Require minimum time per task (e.g., 10 seconds) - Use attention check questions - Collect 3+ annotations per task for agreement filtering
21.4 Domain-Specific Metrics
Different applications require specialized metrics. This section covers evaluation frameworks for common embedding use cases.
21.4.1 E-Commerce and Product Search
Show e-commerce metrics
import numpy as npfrom collections import Counterclass ECommerceMetrics:"""Evaluation metrics specific to e-commerce search and recommendations."""@staticmethoddef zero_result_rate(queries: list, results: list) ->float:"""Fraction of queries returning no results.""" zero_results =sum(1for r in results iflen(r) ==0)return zero_results /len(queries)@staticmethoddef query_abandonment_rate( queries: list, results: list, clicks: list ) ->float:"""Fraction of queries where user didn't click any result.""" abandoned =sum(1for r, c inzip(results, clicks)iflen(r) >0andlen(c) ==0 ) queries_with_results =sum(1for r in results iflen(r) >0)return abandoned / queries_with_results if queries_with_results >0else0@staticmethoddef catalog_coverage( recommended_items: list, total_catalog_size: int ) ->float:"""Fraction of catalog that appears in recommendations.""" unique_recommended =len(set(item for items in recommended_items for item in items))return unique_recommended / total_catalog_size@staticmethoddef diversity_at_k(results: list, item_categories: dict, k: int=10) ->float:""" Intra-list diversity: variety of categories in top-K results. Higher = more diverse recommendations. """ diversities = []for result in results: top_k = result[:k] categories = [item_categories.get(item, "unknown") for item in top_k] unique_categories =len(set(categories)) diversities.append(unique_categories / k if k >0else0)return np.mean(diversities)@staticmethoddef novelty( recommendations: list, item_popularity: dict, k: int=10 ) ->float:""" Novelty: tendency to recommend less popular (long-tail) items. Higher = recommending more novel items. """ novelties = [] max_pop =max(item_popularity.values()) if item_popularity else1for rec in recommendations: top_k = rec[:k]# Novelty = -log(popularity), normalized item_novelties = []for item in top_k: pop = item_popularity.get(item, 1) / max_pop item_novelties.append(-np.log(pop +1e-10)) novelties.append(np.mean(item_novelties) if item_novelties else0)return np.mean(novelties)@staticmethoddef revenue_per_search( queries: list, clicks: list, purchases: list, item_prices: dict ) ->float:"""Average revenue generated per search query.""" total_revenue =sum( item_prices.get(item, 0)for purchase_list in purchasesfor item in purchase_list )return total_revenue /len(queries) if queries else0# Examplemetrics = ECommerceMetrics()# Simulate search resultsqueries = ["wireless headphones", "running shoes", "laptop case"]results = [ ["item_1", "item_2", "item_3"], ["item_4", "item_5"], [] # Zero result query]print(f"Zero result rate: {metrics.zero_result_rate(queries, results):.1%}")# Catalog coverageall_recommended = [["item_1", "item_2"], ["item_3", "item_4"], ["item_1", "item_5"]]print(f"Catalog coverage: {metrics.catalog_coverage(all_recommended, 100):.1%}")# Diversityitem_categories = {f"item_{i}": f"cat_{i %3}"for i inrange(10)}print(f"Diversity@3: {metrics.diversity_at_k(results[:2], item_categories, k=3):.2f}")
Zero result rate: 33.3%
Catalog coverage: 5.0%
Diversity@3: 0.83
21.4.2 Recommendation Systems
Show recommendation metrics
import numpy as npclass RecommendationMetrics:"""Metrics for evaluating recommendation systems."""@staticmethoddef hit_rate( recommendations: list, ground_truth: list, k: int=10 ) ->float:"""Fraction of users with at least one relevant item in top-K.""" hits =0for recs, truth inzip(recommendations, ground_truth):ifset(recs[:k]) &set(truth): hits +=1return hits /len(recommendations)@staticmethoddef serendipity( recommendations: list, user_history: list, item_similarity: dict, k: int=10 ) ->float:""" Serendipity: relevant recommendations that are unexpected. Balances relevance with surprise. """ serendipities = []for recs, history inzip(recommendations, user_history): top_k = recs[:k] rec_serendipity = []for rec in top_k:# How different is this from user's history? min_similarity =min( item_similarity.get((rec, h), item_similarity.get((h, rec), 0.5))for h in history ) if history else1.0# Serendipity = 1 - max_similarity (higher when more different) rec_serendipity.append(1- min_similarity) serendipities.append(np.mean(rec_serendipity) if rec_serendipity else0)return np.mean(serendipities)@staticmethoddef gini_coefficient(item_recommendation_counts: list) ->float:""" Gini coefficient of recommendation distribution. 0 = perfect equality (all items recommended equally) 1 = perfect inequality (one item gets all recommendations) Use to detect popularity bias. """ counts = np.array(sorted(item_recommendation_counts)) n =len(counts) index = np.arange(1, n +1)return (2* np.sum(index * counts) - (n +1) * np.sum(counts)) / (n * np.sum(counts))@staticmethoddef beyond_accuracy_report( recommendations: list, item_categories: dict, item_popularity: dict, k: int=10 ) ->dict:"""Comprehensive beyond-accuracy metrics report."""# Aggregate statistics all_recs = [item for rec in recommendations for item in rec[:k]] rec_counts = Counter(all_recs)# Coverage coverage =len(set(all_recs)) /len(item_popularity)# Gini (popularity concentration) popularity_counts =list(rec_counts.values()) gini = RecommendationMetrics.gini_coefficient(popularity_counts)# Category coverage rec_categories =set(item_categories.get(item, "unk") for item inset(all_recs)) category_coverage =len(rec_categories) /len(set(item_categories.values()))# Popularity bias avg_pop = np.mean([item_popularity.get(item, 0) for item in all_recs]) overall_avg_pop = np.mean(list(item_popularity.values())) popularity_bias = avg_pop / overall_avg_popreturn {"catalog_coverage": coverage,"gini_coefficient": gini,"category_coverage": category_coverage,"popularity_bias": popularity_bias, # >1 means biased toward popular"unique_items_recommended": len(set(all_recs)) }# Examplemetrics = RecommendationMetrics()# Simulaterecommendations = [ ["item_1", "item_2", "item_3"], ["item_1", "item_4", "item_5"], ["item_1", "item_2", "item_6"]]ground_truth = [["item_2", "item_7"], ["item_4"], ["item_6", "item_8"]]print(f"Hit rate@3: {metrics.hit_rate(recommendations, ground_truth, k=3):.2f}")# Beyond accuracyitem_popularity = {f"item_{i}": 100- i*10for i inrange(1, 11)}item_categories = {f"item_{i}": f"cat_{i %3}"for i inrange(1, 11)}report = metrics.beyond_accuracy_report(recommendations, item_categories, item_popularity, k=3)print(f"\nBeyond-accuracy metrics:")for k, v in report.items():print(f" {k}: {v:.2f}")
Detection rate at 1% FPR: 24.2%
Cost reduction vs baseline: 23.0%
21.5 Statistical Rigor
Embedding evaluation requires statistical rigor to draw valid conclusions. This section covers sample size calculation, significance testing, and multiple comparison corrections.
21.5.1 Sample Size and Power Analysis
Show power analysis for A/B testing
import numpy as npfrom scipy import statsdef sample_size_for_metric_change( baseline_metric: float, minimum_detectable_effect: float, metric_variance: float, alpha: float=0.05, power: float=0.8) ->int:""" Calculate required sample size for detecting a metric change. Args: baseline_metric: Current metric value (e.g., 0.15 for 15% CTR) minimum_detectable_effect: Relative change to detect (e.g., 0.05 for 5% improvement) metric_variance: Variance of the metric alpha: Significance level (Type I error rate) power: Statistical power (1 - Type II error rate) Returns: Required sample size per group """ effect_size = baseline_metric * minimum_detectable_effect z_alpha = stats.norm.ppf(1- alpha/2) # Two-tailed z_beta = stats.norm.ppf(power)# Sample size formula for two-sample t-test n =2* ((z_alpha + z_beta) **2) * metric_variance / (effect_size **2)returnint(np.ceil(n))def minimum_detectable_effect( sample_size: int, baseline_metric: float, metric_variance: float, alpha: float=0.05, power: float=0.8) ->float:""" Calculate minimum detectable effect given sample size. """ z_alpha = stats.norm.ppf(1- alpha/2) z_beta = stats.norm.ppf(power) effect = np.sqrt(2* ((z_alpha + z_beta) **2) * metric_variance / sample_size)return effect / baseline_metric# Example: CTR experimentbaseline_ctr =0.15ctr_variance = baseline_ctr * (1- baseline_ctr) # Bernoulli varianceprint("Sample size requirements for CTR experiment:")for mde in [0.01, 0.02, 0.05, 0.10]: n = sample_size_for_metric_change(baseline_ctr, mde, ctr_variance)print(f" Detect {mde:.0%} change: {n:,} samples per group")print("\nMinimum detectable effect for given sample sizes:")for n in [1000, 10000, 100000]: mde = minimum_detectable_effect(n, baseline_ctr, ctr_variance)print(f" n={n:,}: can detect {mde:.1%} change")
Sample size requirements for CTR experiment:
Detect 1% change: 889,540 samples per group
Detect 2% change: 222,385 samples per group
Detect 5% change: 35,582 samples per group
Detect 10% change: 8,896 samples per group
Minimum detectable effect for given sample sizes:
n=1,000: can detect 29.8% change
n=10,000: can detect 9.4% change
n=100,000: can detect 3.0% change
21.5.2 Confidence Intervals for Metrics
Show bootstrap confidence intervals
import numpy as npdef bootstrap_confidence_interval( metric_func, data: np.ndarray, n_bootstrap: int=1000, confidence: float=0.95) ->dict:""" Compute bootstrap confidence interval for any metric. Args: metric_func: Function that computes metric from data data: Array of data points n_bootstrap: Number of bootstrap samples confidence: Confidence level """ point_estimate = metric_func(data) bootstrap_estimates = []for _ inrange(n_bootstrap):# Sample with replacement sample = np.random.choice(data, size=len(data), replace=True) bootstrap_estimates.append(metric_func(sample)) bootstrap_estimates = np.array(bootstrap_estimates) alpha =1- confidence lower = np.percentile(bootstrap_estimates, 100* alpha/2) upper = np.percentile(bootstrap_estimates, 100* (1- alpha/2))return {"point_estimate": point_estimate,"ci_lower": lower,"ci_upper": upper,"confidence": confidence,"std_error": bootstrap_estimates.std() }# Example: Confidence interval for recall@10np.random.seed(42)# Simulate recall values for 1000 queriesrecall_values = np.random.beta(8, 2, 1000) # Skewed distributionresult = bootstrap_confidence_interval(np.mean, recall_values)print(f"Recall@10: {result['point_estimate']:.4f}")print(f"95% CI: [{result['ci_lower']:.4f}, {result['ci_upper']:.4f}]")print(f"Standard error: {result['std_error']:.4f}")
Recall@10: 0.7956
95% CI: [0.7880, 0.8023]
Standard error: 0.0036
21.5.3 Multiple Testing Correction
When evaluating multiple metrics, the chance of false positives increases. Apply corrections:
Show multiple testing correction
import numpy as npdef bonferroni_correction(p_values: list, alpha: float=0.05) ->dict:""" Bonferroni correction: most conservative. Adjusted alpha = alpha / n_tests """ n_tests =len(p_values) adjusted_alpha = alpha / n_tests significant = [p < adjusted_alpha for p in p_values]return {"method": "bonferroni","original_alpha": alpha,"adjusted_alpha": adjusted_alpha,"significant": significant,"n_significant": sum(significant) }def benjamini_hochberg_correction(p_values: list, alpha: float=0.05) ->dict:""" Benjamini-Hochberg: controls False Discovery Rate. Less conservative than Bonferroni, more power. """ n_tests =len(p_values) sorted_indices = np.argsort(p_values) sorted_p = np.array(p_values)[sorted_indices]# BH threshold: p_i <= (i/n) * alpha thresholds = [(i +1) / n_tests * alpha for i inrange(n_tests)]# Find largest k where p_k <= threshold_k significant_sorted = [False] * n_tests max_significant =-1for i inrange(n_tests):if sorted_p[i] <= thresholds[i]: max_significant = ifor i inrange(max_significant +1): significant_sorted[i] =True# Map back to original order significant = [False] * n_testsfor i, orig_idx inenumerate(sorted_indices): significant[orig_idx] = significant_sorted[i]return {"method": "benjamini_hochberg","original_alpha": alpha,"significant": significant,"n_significant": sum(significant) }# Example: Testing multiple metricsnp.random.seed(42)p_values = [0.001, 0.02, 0.03, 0.04, 0.06, 0.15, 0.25]metric_names = ["NDCG@10", "Recall@10", "MRR", "Precision@10", "CTR", "Dwell", "Bounce"]print("P-values and significance (alpha=0.05):")print("-"*50)bonf = bonferroni_correction(p_values)bh = benjamini_hochberg_correction(p_values)for i, (name, p) inenumerate(zip(metric_names, p_values)): bonf_sig ="✓"if bonf["significant"][i] else"✗" bh_sig ="✓"if bh["significant"][i] else"✗"print(f"{name:15} p={p:.3f} Bonferroni: {bonf_sig} BH: {bh_sig}")print("-"*50)print(f"Bonferroni significant: {bonf['n_significant']}/{len(p_values)}")print(f"Benjamini-Hochberg significant: {bh['n_significant']}/{len(p_values)}")
Bonferroni: Use when false positives are very costly (medical, financial decisions). Very conservative—may miss real effects.
Benjamini-Hochberg: Use for exploratory analysis or when some false positives are acceptable. Controls False Discovery Rate rather than family-wise error rate.
No correction: Only when metrics are truly independent and you’re comfortable with inflated Type I error.
Rule of thumb: If you’re making decisions based on results, use correction. If exploring data for hypotheses to test later, correction may be optional.
21.6 Evaluation at Scale
Evaluating embeddings over trillions of items requires efficient sampling and computation strategies.
21.6.1 Stratified Sampling for Large Corpora
Show stratified sampling strategy
import numpy as npfrom collections import defaultdictclass StratifiedEvaluationSampler:"""Efficient stratified sampling for large-scale evaluation."""def__init__(self, corpus_size: int, strata_assignments: dict):""" Args: corpus_size: Total number of items strata_assignments: {stratum_name: [item_indices]} """self.corpus_size = corpus_sizeself.strata = strata_assignmentsdef sample_stratified(self, total_sample_size: int, allocation: str="proportional", min_per_stratum: int=100 ) ->dict:""" Draw stratified sample. Args: total_sample_size: Total samples to draw allocation: 'proportional', 'equal', or 'neyman' (optimal) min_per_stratum: Minimum samples per stratum """ stratum_sizes = {k: len(v) for k, v inself.strata.items()} n_strata =len(self.strata)# Determine allocationif allocation =="proportional": weights = {k: v /self.corpus_size for k, v in stratum_sizes.items()}elif allocation =="equal": weights = {k: 1/ n_strata for k inself.strata}else:raiseValueError(f"Unknown allocation: {allocation}")# Allocate samples samples_per_stratum = {} remaining = total_sample_size - min_per_stratum * n_stratafor stratum inself.strata: base = min_per_stratum additional =int(remaining * weights[stratum]) samples_per_stratum[stratum] =min(base + additional, stratum_sizes[stratum])# Draw samples sampled_indices = {}for stratum, indices inself.strata.items(): n_sample = samples_per_stratum[stratum] sampled_indices[stratum] = np.random.choice( indices, size=n_sample, replace=False ).tolist()return {"samples_per_stratum": samples_per_stratum,"sampled_indices": sampled_indices,"total_sampled": sum(samples_per_stratum.values()) }def oversample_rare_strata(self, base_sample: dict, rare_strata: list, oversample_factor: float=3.0 ) ->dict:"""Oversample rare but important strata (e.g., tail queries).""" enhanced_indices =dict(base_sample["sampled_indices"])for stratum in rare_strata:if stratum inself.strata: current_n =len(enhanced_indices[stratum]) target_n =min(int(current_n * oversample_factor),len(self.strata[stratum]) ) enhanced_indices[stratum] = np.random.choice(self.strata[stratum], size=target_n, replace=False ).tolist()return {"sampled_indices": enhanced_indices,"total_sampled": sum(len(v) for v in enhanced_indices.values()) }# Examplenp.random.seed(42)corpus_size =10_000_000# Define strata based on item popularitystrata = {"head": list(range(0, 1000)), # Top 1K items (0.01%)"torso": list(range(1000, 100000)), # Next 99K (1%)"tail": list(range(100000, corpus_size)) # Rest (99%)}sampler = StratifiedEvaluationSampler(corpus_size, strata)# Draw samplesample = sampler.sample_stratified(total_sample_size=10000)print("Proportional stratified sample:")for stratum, n in sample["samples_per_stratum"].items():print(f" {stratum}: {n} samples")# Oversample tailenhanced = sampler.oversample_rare_strata(sample, rare_strata=["tail"])print(f"\nAfter oversampling tail: {enhanced['total_sampled']} total samples")
Proportional stratified sample:
head: 100 samples
torso: 196 samples
tail: 9703 samples
After oversampling tail: 29405 total samples
Recall@10: 0.0001
95% CI: [0.0000, 0.0002]
Evaluated on 1000 queries
21.7 Key Takeaways
Intrinsic quality metrics (isotropy, uniformity, alignment) detect embedding problems without downstream tasks—monitor them continuously to catch degradation early
Choose retrieval metrics based on your use case: Recall@K for coverage, Precision@K when false positives are costly, NDCG for graded relevance, MRR for navigational queries, MAP for comprehensive ranking quality
Human evaluation provides ground truth that automated metrics cannot capture—design clear tasks, use quality controls, and measure inter-annotator agreement
Domain-specific metrics matter: E-commerce needs zero-result rate and catalog coverage; recommendations need diversity and novelty; fraud detection needs cost-sensitive evaluation
Statistical rigor is essential: Calculate required sample sizes, report confidence intervals, and apply multiple testing corrections when evaluating many metrics
Scale requires smart sampling: Use stratified sampling, oversample rare but important segments, and compute confidence intervals to quantify uncertainty
21.8 Looking Ahead
Chapter 22 shifts focus from evaluation to serving, exploring high-performance vector operations: optimized similarity search algorithms, approximate nearest neighbor (ANN) methods, GPU acceleration for vector operations, memory-mapped storage strategies, and parallel query processing that enables sub-millisecond similarity search across billion-vector indices.
21.9 Further Reading
21.9.1 Intrinsic Quality Metrics
Mu, Jiaqi, et al. (2018). “All-but-the-Top: Simple and Effective Postprocessing for Word Representations.” ICLR.
Wang, Tongzhou, and Phillip Isola (2020). “Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere.” ICML.
Ethayarajh, Kawin (2019). “How Contextual are Contextualized Word Representations?” EMNLP.
21.9.2 Retrieval Evaluation
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze (2008). “Introduction to Information Retrieval.” Cambridge University Press. Chapter 8.
Järvelin, Kalervo, and Jaana Kekäläinen (2002). “Cumulated Gain-Based Evaluation of IR Techniques.” ACM TOIS.
Craswell, Nick (2009). “Mean Reciprocal Rank.” Encyclopedia of Database Systems.
21.9.3 Human Evaluation
Voorhees, Ellen M. (2000). “Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness.” Information Processing & Management.
Carterette, Ben (2011). “System Effectiveness, User Models, and User Utility: A Conceptual Framework for Investigation.” SIGIR.
Alonso, Omar, and Stefano Mizzaro (2012). “Using Crowdsourcing for TREC Relevance Assessment.” Information Processing & Management.
21.9.4 Statistical Methods
Sakai, Tetsuya (2014). “Statistical Reform in Information Retrieval?” SIGIR Forum.
Carterette, Ben (2012). “Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments.” ACM TOIS.
Smucker, Mark D., James Allan, and Ben Carterette (2007). “A Comparison of Statistical Significance Tests for Information Retrieval Evaluation.” CIKM.
21.9.5 Beyond-Accuracy Evaluation
Ge, Mouzhi, Carla Delgado-Battenfeld, and Dietmar Jannach (2010). “Beyond Accuracy: Evaluating Recommender Systems by Coverage and Serendipity.” RecSys.
Kaminskas, Marius, and Derek Bridge (2016). “Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems.” ACM TIST.
Castells, Pablo, Neil J. Hurley, and Saul Vargas (2015). “Novelty and Diversity in Recommender Systems.” Recommender Systems Handbook.