This chapter bridges strategic planning and implementation by answering a critical question: when should you build custom embeddings versus fine-tuning existing models? We explore domain-specific requirements, multi-objective design, dimensionality optimization, and cost-performance trade-offs that determine success at scale.
14.1 When to Build Custom Embeddings vs. Fine-Tune
The decision to build custom embeddings from scratch versus fine-tuning pre-trained models is one of the most consequential choices in your embedding strategy. Make the wrong choice and you’ll either waste months building unnecessary infrastructure or deploy suboptimal models that never reach competitive performance.
14.1.1 The Custom vs. Fine-Tune Spectrum
Most discussions frame this as a binary choice. In reality, it’s a spectrum with five distinct approaches:
Note
The following cost and quality estimates are rough guidelines based on typical projects. Actual results vary significantly based on domain, data quality, team expertise, and specific requirements.
Level 0: Use Pre-trained, Frozen
Description: Use off-the-shelf embeddings (OpenAI, Sentence-BERT) without modification
Effort: Hours
Cost: $0-$1K/month
Quality: 60-70% of optimal for your domain
Best for: Proof-of-concepts, generic use cases, rapid prototyping
Level 1: Prompt Engineering
Description: Optimize prompts for pre-trained models to better capture domain nuances
Effort: Days to weeks
Cost: $1K-$5K/month
Quality: 70-80% of optimal
Best for: Specific queries, instruction-based models, low-budget projects
Level 2: Fine-Tune Last Layers
Description: Fine-tune final layers of pre-trained model on your domain data
Effort: Weeks
Cost: $5K-$25K one-time + ongoing inference
Quality: 80-90% of optimal
Best for: Domain adaptation with limited data (10K-100K examples)
Level 3: Full Model Fine-Tuning
Description: Fine-tune entire pre-trained model on your data
Effort: 1-3 months
Cost: $25K-$150K one-time + ongoing
Quality: 85-95% of optimal
Best for: Substantial domain data (100K-10M examples), clear performance gaps
Level 4: Train From Scratch
Description: Design and train custom architecture for your specific requirements
Effort: 6-18 months
Cost: $500K-$5M+ one-time + ongoing
Quality: 95-100% optimal (when done right)
Best for: Highly specialized domains, massive data (10M+ examples), competitive moat
TipThe 80/20 Rule
For most organizations, Level 3 (Full Model Fine-Tuning) delivers 95% of the benefit at 20% of the cost compared to training from scratch. Only pursue Level 4 if embeddings are core to your competitive advantage.
14.1.2 Decision Framework: When to Build Custom
Use this framework to determine your approach. For each factor, assess whether your situation favors fine-tuning an existing model or building custom embeddings from scratch:
Factor
Favors Fine-Tuning
Favors Custom
Training data
<1M labeled examples
>10M labeled examples
Domain gap
Low/medium (medical, financial)
High (genomics, specialized legal, non-text)
Performance requirement
“Good enough” for business needs
World-class, no compromises
Specialized requirements
Standard text/image
Multi-modal without pre-trained options, tiny models for edge, interpretability
Budget
<$150K
>$500K
Timeline
<6 months
>12 months
Team capability
Limited ML expertise
Published researchers, prior large model experience
Competitive advantage
Embeddings support product
Embeddings ARE the product/moat
How to interpret: If most factors point toward fine-tuning, start with Level 2 or 3. If several factors strongly favor custom (especially domain gap and competitive advantage), consider Level 4.
The hybrid path: When factors are mixed, start with fine-tuning to establish a baseline and prove business value. This de-risks the investment before committing to custom development. Many successful systems follow this pattern—ship a fine-tuned model in months, then build custom after validating the opportunity.
14.1.3 Illustrative Case Studies
Note
The following case studies are hypothetical examples designed to illustrate decision-making patterns. While based on realistic scenarios and typical project parameters, they are not descriptions of specific real-world implementations.
Case Study 1: Medical Literature Search (Fine-Tuning Win)
Consider a medical research platform that might initially consider training custom embeddings for biomedical literature. They might have:
500K labeled medical article pairs
Medium domain gap (medical terminology specialized but well-covered in pre-training)
Result: Could achieve additional ~15% improvement over fine-tuned CLIP
Could enable category-aware search, better handling of attributes
Key Lesson: A hybrid approach can de-risk investment. Fine-tuning provides fast wins; custom models deliver competitive advantage after proving value.
14.1.4 The Fine-Tuning Recipe
When fine-tuning is the right choice, follow this battle-tested recipe:
Show embedding fine-tuner implementation
from sentence_transformers import InputExample, SentenceTransformer, lossesfrom torch.utils.data import DataLoaderclass EmbeddingFineTuner:"""Production-ready fine-tuning for sentence embeddings"""def__init__(self, base_model_name="all-mpnet-base-v2"):self.model = SentenceTransformer(base_model_name)self.base_model_name = base_model_namedef prepare_training_data(self, examples):"""Prepare training data (query, positive, optional negative)""" train_examples = []for ex in examples:if"negative"in ex: train_examples.append(InputExample(texts=[ex["query"], ex["positive"], ex["negative"]]))else: train_examples.append(InputExample(texts=[ex["query"], ex["positive"]], label=1.0))return DataLoader(train_examples, shuffle=True, batch_size=16)def fine_tune(self, train_dataloader, num_epochs=3, loss_function="cosine", warmup_steps=100):"""Fine-tune with cosine, triplet, or contrastive loss"""if loss_function =="cosine": train_loss = losses.CosineSimilarityLoss(self.model)elif loss_function =="triplet": train_loss = losses.TripletLoss(model=self.model, triplet_margin=0.5)elif loss_function =="contrastive": train_loss = losses.ContrastiveLoss(self.model)self.model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs, warmup_steps=warmup_steps, optimizer_params={"lr": 2e-5}, show_progress_bar=True )def save_model(self, output_path):self.model.save(output_path)# Usage exampletraining_data = [ {"query": "comfortable running shoes", "positive": "Nike Air Zoom - cushioning for running","negative": "Nike Basketball Shoes - high-top for court"},]finetuner = EmbeddingFineTuner(base_model_name="all-mpnet-base-v2")print(f"Fine-tuner initialized with model: {finetuner.base_model_name}")
Fine-tuner initialized with model: all-mpnet-base-v2
ImportantFine-Tuning Pitfalls
Common mistakes that tank fine-tuning performance: 1. Insufficient data: Need 10K+ examples minimum, 100K+ for best results 2. Poor negative sampling: Random negatives too easy; model doesn’t learn distinction 3. Catastrophic forgetting: Fine-tuning destroys general capabilities; use lower learning rates 4. Overfitting to training distribution: Test on out-of-distribution examples
14.2 Domain-Specific Embedding Requirements
Generic embeddings optimize for average performance across diverse tasks. Domain-specific embeddings optimize for your specific requirements. Understanding and articulating these requirements is critical for successful custom embedding development.
14.2.1 Taxonomy of Domain-Specific Requirements
1. Semantic Granularity
How fine-grained must similarity be?
class SemanticGranularity:""" Examples of semantic granularity requirements across domains """ COARSE = {'name': 'Coarse-grained','example': 'News article categorization','requirement': 'Distinguish broad topics (sports vs. politics vs. technology)','embedding_dim': '128-256 sufficient','training_data': '10K-100K examples' } MEDIUM = {'name': 'Medium-grained','example': 'E-commerce product search','requirement': 'Distinguish product types and attributes (running shoes vs. hiking boots)','embedding_dim': '256-512 recommended','training_data': '100K-1M examples' } FINE = {'name': 'Fine-grained','example': 'Legal document retrieval','requirement': 'Distinguish subtle legal distinctions (contract types, precedent applicability)','embedding_dim': '512-768 recommended','training_data': '1M-10M examples' } ULTRA_FINE = {'name': 'Ultra-fine','example': 'Molecular drug discovery','requirement': 'Distinguish molecules with minor structural differences that dramatically affect properties','embedding_dim': '768-1024+ required','training_data': '10M+ examples or sophisticated augmentation' }
The Granularity-Dimension Relationship: Finer semantic distinctions require higher-dimensional embeddings. You cannot reliably distinguish 10,000 fine-grained categories in 128 dimensions—the information simply doesn’t fit.
2. Asymmetric Similarity
Are similarities symmetric or asymmetric?
class AsymmetricSimilarity:""" Handle asymmetric similarity (query → document differs from document → query) """def__init__(self, embedding_dim=512):self.query_encoder = QueryEncoder(embedding_dim)self.document_encoder = DocumentEncoder(embedding_dim)def encode_query(self, query_text):""" Encode query with query-specific model Queries are typically short, focused, and incomplete """returnself.query_encoder.encode(query_text)def encode_document(self, document_text):""" Encode document with document-specific model Documents are longer, complete, and information-rich """returnself.document_encoder.encode(document_text)def similarity(self, query_embedding, document_embedding):""" Asymmetric similarity: query → document """# In asymmetric setup, similarity is directional# "running shoes" → "Nike Air Zoom Pegasus..." (HIGH similarity)# "Nike Air Zoom Pegasus..." → "running shoes" (LOWER similarity - too specific)return cosine_similarity(query_embedding, document_embedding)# Use cases requiring asymmetric similarity:asymmetric_use_cases = [ {'domain': 'Question Answering','query': 'Short question','target': 'Long passage with answer','asymmetry': 'Question seeks answer; answer does not seek question' }, {'domain': 'Web Search','query': '2-5 keywords','target': 'Full web page content','asymmetry': 'Query is intent; document is content' }, {'domain': 'Image Search','query': 'Text description','target': 'Image','asymmetry': 'Cross-modal: text → image different from image → text' }, {'domain': 'Recommendation','query': 'User behavior history','target': 'Product catalog','asymmetry': 'User history implies preferences; products have features' }]
Why Asymmetric Matters: Using symmetric embeddings (same encoder for queries and documents) for asymmetric tasks leaves performance on the table. Specialized encoders can optimize for each side’s characteristics.
3. Multi-Faceted Similarity
Do items have multiple aspects of similarity?
class MultiFacetedEmbeddings:""" Represent multiple facets of similarity in separate embedding spaces """def__init__(self):# E-commerce example: products similar in different waysself.visual_encoder = VisualEncoder() # Visual appearanceself.functional_encoder = FunctionalEncoder() # Use case/functionself.attribute_encoder = AttributeEncoder() # Specific attributes (brand, price, etc.)def encode_product(self, product):""" Encode product with multiple faceted embeddings """return {'visual': self.visual_encoder.encode(product.images),'functional': self.functional_encoder.encode(product.description),'attributes': self.attribute_encoder.encode({'brand': product.brand,'price_tier': self.discretize_price(product.price),'category': product.category }) }def multi_faceted_search(self, query, facet_weights=None):""" Search using multiple facets with different weights """if facet_weights isNone: facet_weights = {'visual': 0.4, 'functional': 0.4, 'attributes': 0.2}# Encode query (may not have all facets) query_embs =self.encode_query(query)# Search each facet independently results_by_facet = {}for facet in query_embs: results_by_facet[facet] =self.search_facet( query_embs[facet], facet_index=getattr(self, f'{facet}_index') )# Combine results with weighted fusion final_results =self.fuse_facet_results( results_by_facet, weights=facet_weights )return final_results
Multi-Faceted Use Cases:
E-commerce: Visual similarity (looks like), functional similarity (used for same purpose), price similarity
class TemporalEmbeddings:""" Handle time-varying embeddings """def__init__(self, embedding_dim=512, time_encoding_dim=64):self.static_encoder = StaticEncoder(embedding_dim - time_encoding_dim)self.time_encoder = TimeEncoder(time_encoding_dim)self.embedding_dim = embedding_dimdef encode_with_time(self, content, timestamp):""" Encode content with temporal context """# Static content embedding static_emb =self.static_encoder.encode(content)# Time encoding (positional encoding or learned) time_emb =self.time_encoder.encode(timestamp)# Concatenate temporal_emb = torch.cat([static_emb, time_emb], dim=-1)return temporal_embdef time_decayed_similarity(self, query_time, document_time, document_emb):""" Adjust similarity based on temporal distance """ time_diff_days =abs((query_time - document_time).days)# Exponential decay: more recent = more relevant decay_factor = np.exp(-time_diff_days /180) # 180-day half-lifereturn document_emb * decay_factor# Domains requiring temporal awareness:temporal_use_cases = [ {'domain': 'News Search','requirement': 'Recent articles more relevant for most queries','approach': 'Time decay on similarity scores' }, {'domain': 'Social Media','requirement': 'Trending topics change rapidly','approach': 'Short-window embeddings, frequent retraining' }, {'domain': 'Fashion/Trends','requirement': 'Style similarity depends on current trends','approach': 'Time-conditioned embeddings, seasonal retraining' }, {'domain': 'Scientific Research','requirement': 'Paradigm shifts change what\'s similar','approach': 'Period-specific embeddings (pre/post major discoveries)' }]
5. Hierarchical Structure
Do your items have natural hierarchies?
class HierarchicalEmbeddings:""" Preserve hierarchical structure in embedding space """def__init__(self):self.level_encoders = {'category': Encoder(dim=256), # Coarse level'subcategory': Encoder(dim=512), # Medium level'product': Encoder(dim=768) # Fine level }def encode_hierarchical(self, item, level='product'):""" Encode at different hierarchy levels Example: Category: "Electronics" Subcategory: "Smartphones" Product: "iPhone 15 Pro Max 256GB" """ embeddings = {}# Encode at each level in hierarchyfor level_name in ['category', 'subcategory', 'product']:if level_name in item: embeddings[level_name] =self.level_encoders[level_name].encode( item[level_name] )# Stop at requested levelif level_name == level:breakreturn embeddingsdef hierarchical_search(self, query, level='product'):""" Search at appropriate hierarchy level Coarse queries ("electronics") match at category level Fine queries ("iphone 15 pro max") match at product level """# Classify query specificity query_level =self.infer_query_level(query)# Encode at appropriate level query_emb =self.level_encoders[query_level].encode(query)# Search at that level results =self.search_at_level(query_emb, level=query_level)return results
14.2.2 Domain-Specific Training Objectives
Different domains require different training objectives:
Show domain-specific training objectives
import torchimport torch.nn.functional as Fclass DomainSpecificObjectives:"""Domain-specific training objectives beyond standard contrastive learning"""def ranking_loss(self, query_emb, doc_embs, relevance_labels):"""Ranking loss: Learn to order documents by relevance""" scores = torch.matmul(query_emb, doc_embs.T) loss =0for i inrange(len(doc_embs)):for j inrange(i +1, len(doc_embs)):if relevance_labels[i] > relevance_labels[j]: loss += torch.clamp(1.0- (scores[i] - scores[j]), min=0.0)return loss / (len(doc_embs) * (len(doc_embs) -1) /2)def attribute_preservation_loss(self, embedding, attributes):"""Ensure embeddings preserve important attributes (category, brand, price)""" losses = []for attr_name, attr_value in attributes.items(): attr_classifier =self.attribute_classifiers[attr_name] pred = attr_classifier(embedding) loss = F.cross_entropy(pred, attr_value) losses.append(loss)returnsum(losses)def diversity_loss(self, embeddings):"""Encourage embedding diversity (avoid collapse)""" pairwise_sim = torch.matmul(embeddings, embeddings.T) mask =~torch.eye(len(embeddings), dtype=torch.bool)return pairwise_sim[mask].mean()# Usage exampleobjectives = DomainSpecificObjectives()print("Domain objectives: ranking, attribute preservation, diversity, cross-domain alignment")
Most real-world embedding systems must optimize for multiple objectives simultaneously. Single-objective optimization leaves performance on the table.
14.3.1 The Multi-Objective Challenge
Consider an e-commerce search system. The embedding should: 1. Semantic relevance: Match customer intent 2. Attribute accuracy: Preserve product attributes (category, brand, price) 3. Personalization: Adapt to user preferences 4. Business metrics: Optimize for conversion, revenue, not just clicks 5. Diversity: Avoid filter bubbles, show variety
Optimizing for one objective often degrades others. Multi-objective design balances these trade-offs.
Multi-objective optimization involves trade-offs. Visualize and navigate the Pareto frontier:
Show multi-objective optimization
class MultiObjectiveOptimization:"""Navigate trade-offs between multiple objectives"""def compute_pareto_frontier(self, models, test_data):"""Compute Pareto frontier across objectives""" evaluations = []for model in models: metrics = {"model": model,"relevance": self.evaluate_relevance(model, test_data),"diversity": self.evaluate_diversity(model, test_data),"personalization": self.evaluate_personalization(model, test_data),"business_metrics": self.evaluate_business(model, test_data), } evaluations.append(metrics)# Find Pareto-optimal models (not dominated by any other) pareto_optimal = []for eval_i in evaluations: dominated =Falsefor eval_j in evaluations:if eval_i != eval_j andself.dominates(eval_j, eval_i): dominated =Truebreakifnot dominated: pareto_optimal.append(eval_i)return pareto_optimaldef dominates(self, eval_a, eval_b):"""Check if eval_a dominates eval_b (better on all objectives)""" objectives = ["relevance", "diversity", "personalization", "business_metrics"] better_on_at_least_one =Falsefor obj in objectives:if eval_a[obj] < eval_b[obj]:returnFalseif eval_a[obj] > eval_b[obj]: better_on_at_least_one =Truereturn better_on_at_least_onedef select_operating_point(self, pareto_frontier, business_priorities):"""Select model from Pareto frontier based on business priorities""" best_model, best_score =None, -float("inf")for eval_point in pareto_frontier: weighted_score =sum( business_priorities.get(obj, 0) * eval_point[obj]for obj in ["relevance", "diversity", "personalization", "business_metrics"] )if weighted_score > best_score: best_score, best_model = weighted_score, eval_point["model"]return best_model# Usage exampleoptimizer = MultiObjectiveOptimization()print("Multi-objective: relevance, diversity, personalization, business metrics")
Multi-objective: relevance, diversity, personalization, business metrics
14.4 Embedding Dimensionality Optimization
Embedding dimensionality has profound impacts on performance, cost, and latency. Too low: information loss. Too high: computational waste and overfitting. Finding the optimal dimensionality is critical for production systems.
At trillion-row scale, the cost-performance trade-off becomes the dominant factor in embedding design. This section provides frameworks for optimizing this trade-off.
14.5.1 Total Cost of Ownership (TCO) Model
class EmbeddingTCO:""" Comprehensive TCO model for embedding systems """def__init__(self):# Cloud pricing (approximate, as of 2024)self.storage_cost_per_gb_month =0.023# S3 standardself.compute_cost_per_hour =3.0# A100 GPUself.inference_cost_per_million =10.0# Vector DB queriesdef calculate_tco(self, config, duration_years=3):""" Calculate total cost of ownership Args: config: { 'num_embeddings': 100_000_000_000, 'embedding_dim': 768, 'qps': 10_000, 'training_frequency_per_year': 4, 'team_size': 10 } """# Component 1: Storage storage_cost =self.compute_storage_cost( config['num_embeddings'], config['embedding_dim'], duration_years )# Component 2: Training training_cost =self.compute_training_cost( config['num_embeddings'], config['training_frequency_per_year'], duration_years )# Component 3: Inference inference_cost =self.compute_inference_cost( config['qps'], duration_years )# Component 4: Engineering team team_cost =self.compute_team_cost( config['team_size'], duration_years )# Total total_cost = ( storage_cost + training_cost + inference_cost + team_cost )return {'total_cost_3_years': total_cost,'annual_cost': total_cost / duration_years,'breakdown': {'storage': storage_cost,'training': training_cost,'inference': inference_cost,'team': team_cost },'cost_per_embedding': total_cost / config['num_embeddings'],'cost_per_million_queries': inference_cost / ( config['qps'] *60*60*24*365* duration_years /1_000_000 ) }def compute_storage_cost(self, num_embeddings, dim, duration_years):"""Storage cost with replication and indexing overhead""" bytes_per_embedding = dim *4# float32 total_bytes = num_embeddings * bytes_per_embedding# Index overhead (HNSW adds ~50%) indexed_bytes = total_bytes *1.5# Replication (3x for availability) replicated_bytes = indexed_bytes *3# Convert to GB total_gb = replicated_bytes / (1024**3)# Monthly cost monthly_cost = total_gb *self.storage_cost_per_gb_month# Total over durationreturn monthly_cost *12* duration_yearsdef optimize_for_budget(self, requirements, budget_annual):""" Given requirements and budget, find optimal configuration """# Requirements: {'num_embeddings', 'qps', 'min_quality'}# Budget: annual spending limit# Explore dimension options dimensions = [128, 256, 384, 512, 768] configs = []for dim in dimensions: config = {'num_embeddings': requirements['num_embeddings'],'embedding_dim': dim,'qps': requirements['qps'],'training_frequency_per_year': 4,'team_size': 10 } tco =self.calculate_tco(config, duration_years=1)# Estimate quality (simplified) quality_score =self.estimate_quality(dim, requirements) configs.append({'dimension': dim,'annual_cost': tco['annual_cost'],'quality_score': quality_score,'within_budget': tco['annual_cost'] <= budget_annual })# Filter to budget viable = [c for c in configs if c['within_budget']]ifnot viable:return {'recommendation': 'INSUFFICIENT_BUDGET','message': f"Minimum cost: ${min(c['annual_cost'] for c in configs):,.0f}/year" }# Choose highest quality within budget best =max(viable, key=lambda c: c['quality_score'])return {'recommendation': 'OPTIMAL_CONFIG','dimension': best['dimension'],'annual_cost': best['annual_cost'],'quality_score': best['quality_score'],'configurations_evaluated': configs }
14.5.2 Performance-Cost Pareto Frontier
Navigate the trade-off space:
Show cost-performance frontier analysis
class CostPerformanceFrontier:"""Explore cost-performance trade-offs"""def generate_configuration_space(self, requirements):"""Generate configurations spanning cost-performance space""" configs = [] dimensions = [128, 256, 384, 512, 768, 1024] quantizations = ["float32", "float16", "int8", "binary"] index_types = ["flat", "ivf", "hnsw", "pq"]for dim in dimensions:for quant in quantizations:for index in index_types: config = {"dimension": dim, "quantization": quant, "index_type": index,"num_embeddings": requirements["num_embeddings"], } cost =self.estimate_cost(config) performance =self.estimate_performance(config) configs.append({**config, "annual_cost": cost,"p99_latency_ms": performance["latency"], "recall@10": performance["recall"], })return configsdef find_pareto_optimal(self, configs):"""Find Pareto-optimal configurations""" pareto = []for c in configs: dominated =Falsefor other in configs:if (other["recall@10"] >= c["recall@10"] and other["annual_cost"] <= c["annual_cost"] and other["p99_latency_ms"] <= c["p99_latency_ms"] and (other["recall@10"] > c["recall@10"] or other["annual_cost"] < c["annual_cost"] or other["p99_latency_ms"] < c["p99_latency_ms"])): dominated =Truebreakifnot dominated: pareto.append(c)return pareto# Usage examplefrontier = CostPerformanceFrontier()print("Configuration space: 6 dims × 4 quantizations × 4 indices = 96 configs")
Use different dimensions for different data tiers:
class TieredEmbeddings:""" Different embedding dimensions for different data tiers """def__init__(self):self.hot_encoder = HighDimEncoder(dim=768) # Frequent queriesself.warm_encoder = MediumDimEncoder(dim=384) # Moderate queriesself.cold_encoder = LowDimEncoder(dim=128) # Rare queriesdef encode_with_tier(self, item, access_frequency):""" Encode with appropriate dimension based on access frequency """if access_frequency >1000: # >1000 queries/day# Hot tier: high quality, high cost justifiedreturnself.hot_encoder.encode(item), 'hot'elif access_frequency >10:# Warm tier: good quality, moderate costreturnself.warm_encoder.encode(item), 'warm'else:# Cold tier: acceptable quality, low costreturnself.cold_encoder.encode(item), 'cold'# Cost savings:# - 90% of embeddings in cold tier (128-dim): 83% storage savings# - 9% in warm tier (384-dim): 50% savings# - 1% in hot tier (768-dim): full quality# - Overall: ~80% storage cost reduction
14.6 Key Takeaways
The build vs. fine-tune decision follows a spectrum from using frozen pre-trained models (Level 0) to training custom architectures from scratch (Level 4)—most organizations should target Level 3 (full fine-tuning) which delivers 95% of benefits at 20% of cost
Domain-specific requirements shape embedding design across five dimensions: semantic granularity (coarse to ultra-fine), asymmetry (query vs. document), multi-faceted similarity (multiple aspects), temporal dynamics (time-varying relevance), and hierarchical structure
Multi-objective embedding design balances competing goals through multi-task learning (shared encoder with task-specific heads), multi-vector representations (separate embeddings per objective), or constrained optimization (optimize primary objective subject to constraints)
Optimal embedding dimensionality balances capacity and cost—empirical evaluation across dimensions (128-1024) reveals diminishing returns beyond intrinsic dimensionality, with most domains achieving 95%+ quality at 256-512 dimensions vs. 768+ standard models
Dimensionality reduction techniques including PCA-based compression, learned projections, and binary embeddings enable 8-10x cost savings while maintaining acceptable quality for many use cases
Total cost of ownership spans storage, training, inference, and team costs—using the TCO model above, 100B embeddings at 768 dimensions would have annual costs around $47M, but optimization through dimension reduction (768→256), quantization (float32→int8), and tiered storage can achieve 90%+ cost savings
Cost-performance trade-offs navigate the Pareto frontier where different configurations offer optimal points—no single configuration dominates all objectives, requiring explicit business priority weighting to select operating points
14.7 Looking Ahead
Chapter 15 dives deep into contrastive learning—one of the most powerful techniques for training custom embeddings that achieve state-of-the-art performance across diverse domains.
14.8 Further Reading
Devlin, J., et al. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv:1810.04805
Reimers, N., & Gurevych, I. (2019). “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” arXiv:1908.10084
Muennighoff, N., et al. (2022). “SGPT: GPT Sentence Embeddings for Semantic Search.” arXiv:2202.08904
Radford, A., et al. (2021). “Learning Transferable Visual Models From Natural Language Supervision.” arXiv:2103.00020 (CLIP)
Chen, T., et al. (2020). “A Simple Framework for Contrastive Learning of Visual Representations.” arXiv:2002.05709 (SimCLR)
Levina, E., & Bickel, P. (2004). “Maximum Likelihood Estimation of Intrinsic Dimension.” NIPS 2004
Jégou, H., et al. (2011). “Product Quantization for Nearest Neighbor Search.” IEEE TPAMI
Gong, Y., et al. (2020). “Quantization based Fast Inner Product Search.” AISTATS
Ruder, S. (2017). “An Overview of Multi-Task Learning in Deep Neural Networks.” arXiv:1706.05098
Caruana, R. (1997). “Multitask Learning.” Machine Learning 28, 41–75