Training embedding models at scale is only half the battle—serving embeddings in production requires extreme optimization of vector operations. This chapter explores the computational techniques that enable sub-millisecond similarity search across billion-vector indices: optimized similarity search algorithms that go beyond naive comparison, approximate nearest neighbor (ANN) methods that trade minimal accuracy for massive speedups, GPU acceleration strategies that exploit parallelism for vector operations, memory-mapped storage approaches that handle datasets exceeding RAM, and parallel query processing architectures that serve thousands of concurrent searches. These optimizations transform embedding systems from research prototypes to production services capable of handling trillion-row scale with single-digit millisecond latency.
After investing weeks in training high-quality embedding models (Chapter 20) and deploying robust production pipelines (Chapter 19), the final challenge is serving embeddings at scale. A recommendation system might need to search 100 million product embeddings for each user query, processing 10,000 queries per second with p99 latency under 10ms. A fraud detection system might compare incoming transactions against billions of historical embeddings in real-time. These requirements demand optimization at every level: algorithmic improvements, hardware acceleration, memory management, and distributed computation.
22.1 Optimized Similarity Search Algorithms
Similarity search is the core operation in embedding systems: given a query vector, find the most similar vectors in a large corpus. The naive approach—computing similarity between the query and every vector in the index—is prohibitively expensive at scale. Optimized algorithms reduce computation through mathematical insights, data structures, and approximations that maintain high accuracy while dramatically reducing latency.
22.1.1 The Similarity Search Problem
Given:
Query vector q ∈ ℝ^d (embedding dimension d)
Corpus of N vectors {v₁, v₂, …, vₙ} where each vᵢ ∈ ℝ^d
Find: The k vectors in the corpus most similar to q
Naive algorithm complexity: O(N × d) - For each of N vectors, compute d-dimensional similarity - Sort results to find top-k - At scale: 1B vectors × 512 dims × 4 bytes = 2TB of data to scan
Challenge: Reduce from O(N × d) to sub-linear complexity while maintaining high recall
22.1.2 Exact Search Optimizations
Before resorting to approximation, several exact search optimizations provide significant speedups:
Show NumPy SIMD Operations
import numpy as npfrom typing import Listdef cosine_similarity_batch(vectors: np.ndarray, query: np.ndarray) -> np.ndarray:"""Calculate cosine similarity using SIMD-optimized operations.""" query_norm = query / np.linalg.norm(query) norms = np.linalg.norm(vectors, axis=1, keepdims=True) vectors_norm = vectors / normsreturn np.dot(vectors_norm, query_norm)def euclidean_distance_batch(vectors: np.ndarray, query: np.ndarray) -> np.ndarray:"""Calculate Euclidean distance using vectorized operations.""" diff = vectors - queryreturn np.sqrt(np.sum(diff **2, axis=1))def top_k_indices(similarities: np.ndarray, k: int) -> np.ndarray:"""Get indices of top k similarities using partial sort."""return np.argpartition(similarities, -k)[-k:]# Usage examplevectors = np.random.randn(10000, 384) # 10k vectors of dim 384query = np.random.randn(384)similarities = cosine_similarity_batch(vectors, query)top_k_idx = top_k_indices(similarities, k=10)print(f"Top 10 most similar vectors: {top_k_idx}")print(f"Their similarity scores: {similarities[top_k_idx]}")
Top 10 most similar vectors: [2818 9452 9664 6188 8597 4046 8106 615 7782 5355]
Their similarity scores: [0.16286924 0.18852427 0.1673641 0.19109742 0.16615928 0.17029357
0.19189182 0.20186868 0.16481189 0.16290935]
TipWhen to Use Exact vs. Approximate Search
Use exact search when:
Corpus < 10M vectors (exact search fast enough with GPU)
Zero approximation error required (regulatory/compliance)
For billion-vector indices, exact search becomes infeasible. Approximate nearest neighbor (ANN) algorithms trade small amounts of recall for massive speedups—typically achieving 95-99% recall at 10-100× lower latency for million-scale indices, scaling to 100-1000× at billion-scale. Modern ANN methods combine graph-based navigation, quantization, and partitioning to enable sub-millisecond search across trillion-row datasets.
22.2.1 ANN Algorithm Landscape
Partitioning methods (divide space into regions):
IVF (Inverted File Index): Cluster vectors, search only nearby clusters
LSH (Locality-Sensitive Hashing): Hash similar vectors to same buckets
Pro: Simple, fast for low-dimensional data
Con: Curse of dimensionality, many clusters needed for high recall
Graph methods (navigate similarity graph):
HNSW (Hierarchical Navigable Small World): Multi-layer skip list graph
Typical production: 95-99% recall is acceptable for most applications
Always measure recall on holdout test set. A 2× speedup at 80% recall may hurt user experience more than the latency improvement helps.
22.3 GPU Acceleration for Vector Operations
Modern GPUs provide 10-100× speedup for vector operations through massive parallelism. A single NVIDIA A100 GPU has 432 TFLOPS of FP16 throughput—equivalent to thousands of CPU cores. Effective GPU acceleration requires understanding memory hierarchies, kernel optimization, and batching strategies.
22.3.1 GPU Architecture for Vector Operations
Show GPU-Accelerated Vector Search
import numpy as npclass GPUVectorSearch:"""GPU-accelerated similarity search using CuPy."""def__init__(self, use_fp16: bool=False):try:import cupy as cpself.cp = cpself.use_gpu =Trueself.dtype = cp.float16 if use_fp16 else cp.float32exceptImportError:self.use_gpu =Falseprint("CuPy not available, using CPU fallback")def index(self, vectors: np.ndarray):"""Transfer vectors to GPU memory."""ifself.use_gpu:self.vectors_gpu =self.cp.asarray(vectors, dtype=self.dtype)self.norms_gpu =self.cp.linalg.norm(self.vectors_gpu, axis=1, keepdims=True)else:self.vectors_cpu = vectorsdef search_batch(self, queries: np.ndarray, k: int=10) ->list:"""Search multiple queries in parallel on GPU."""ifself.use_gpu: queries_gpu =self.cp.asarray(queries, dtype=self.dtype) query_norms =self.cp.linalg.norm(queries_gpu, axis=1, keepdims=True)# Normalized cosine similarity on GPU vectors_norm =self.vectors_gpu /self.norms_gpu queries_norm = queries_gpu / query_norms similarities =self.cp.dot(queries_norm, vectors_norm.T)# Get top-k indices top_k_indices =self.cp.argsort(similarities, axis=1)[:, -k:]returnself.cp.asnumpy(top_k_indices[:, ::-1])else:# CPU fallback results = []for query in queries: sims = np.dot(self.vectors_cpu, query) top_k = np.argsort(sims)[-k:][::-1] results.append(top_k)return np.array(results)# Usage examplevectors = np.random.randn(1000000, 256).astype(np.float32) # 1M vectorsgpu_search = GPUVectorSearch(use_fp16=True)gpu_search.index(vectors)queries = np.random.randn(100, 256).astype(np.float32) # Batch of 100 queriesresults = gpu_search.search_batch(queries, k=10)print(f"Processed {len(queries)} queries, found {results.shape[1]} results each")
CuPy not available, using CPU fallback
Processed 100 queries, found 10 results each
TipGPU Optimization Best Practices
Memory management:
Use FP16 when possible (2× capacity, minimal accuracy loss)
Pin memory for faster CPU→GPU transfers
Keep frequently accessed vectors in GPU memory
Use unified memory for > GPU capacity (automatic paging)
Computation optimization:
Batch queries to amortize kernel launch overhead (10-100× speedup)
Use Tensor Cores (matrix multiplication) over element-wise ops
Minimize CPU-GPU synchronization points
Profile with nvprof or NSight to find bottlenecks
Multi-GPU scaling:
Shard corpus across GPUs for > 80GB datasets
Use NCCL for fast inter-GPU communication
Pipeline data transfer and computation
Consider model parallelism for very wide embeddings
22.4 Memory-Mapped Vector Storage
Billion-vector indices exceed RAM capacity (1B × 512 dims × 4 bytes = 2TB). Memory-mapped files enable working with datasets larger than memory by loading data on-demand from disk, with the OS managing paging and caching.
Show Memory-Mapped Vector Storage
import numpy as npimport tempfileimport osclass MemoryMappedVectorStore:"""Efficient vector storage using memory-mapped files for datasets larger than RAM."""def__init__(self, filepath: str, dim: int, dtype=np.float32):self.filepath = filepathself.dim = dimself.dtype = dtypeself.mmap =Noneself.count =0def create(self, n_vectors: int):"""Create a new memory-mapped file with pre-allocated capacity.""" shape = (n_vectors, self.dim)self.mmap = np.memmap(self.filepath, dtype=self.dtype, mode='w+', shape=shape)self.count =0# Start with zero vectors writtendef append(self, vectors: np.ndarray):"""Append vectors to the store."""ifself.mmap isNone:# Pre-allocate space for initial batch (can grow later)self.create(len(vectors))self.mmap[self.count:self.count +len(vectors)] = vectorsself.count +=len(vectors)self.mmap.flush()def load(self):"""Load existing memory-mapped file."""if os.path.exists(self.filepath):self.mmap = np.memmap(self.filepath, dtype=self.dtype, mode='r')self.count =self.mmap.shape[0]def get_batch(self, indices: np.ndarray) -> np.ndarray:"""Retrieve specific vectors by index."""returnself.mmap[indices]def search_top_k(self, query: np.ndarray, k: int=10) ->tuple:"""Search for top-k most similar vectors."""# Process in chunks to avoid loading entire index into RAM chunk_size =100000 top_indices = [] top_similarities = []for i inrange(0, self.count, chunk_size): end =min(i + chunk_size, self.count) chunk =self.mmap[i:end] sims = np.dot(chunk, query) / (np.linalg.norm(chunk, axis=1) * np.linalg.norm(query)) chunk_top_k = np.argsort(sims)[-k:] top_indices.extend(i + chunk_top_k) top_similarities.extend(sims[chunk_top_k])# Get global top-k final_top_k = np.argsort(top_similarities)[-k:]return np.array([top_indices[i] for i in final_top_k]), np.array([top_similarities[i] for i in final_top_k])# Usage examplewith tempfile.NamedTemporaryFile(delete=False, suffix='.mmap') as f: store = MemoryMappedVectorStore(f.name, dim=256) vectors = np.random.randn(1000000, 256).astype(np.float32) store.create(len(vectors)) store.mmap[:] = vectors query = np.random.randn(256).astype(np.float32) indices, sims = store.search_top_k(query, k=10)print(f"Top 10 indices: {indices}")print(f"Top 10 similarities: {sims}") os.unlink(f.name)
Top 10 indices: []
Top 10 similarities: []
TipMemory-Mapped Storage Best Practices
When to use:
Dataset > available RAM
Can tolerate slightly higher latency (0.1-1ms SSD access vs <0.1ms RAM)
Access patterns have locality (similar vectors accessed together)
Cost-sensitive (avoid paying for 1TB+ RAM)
Optimizations:
Use SSDs (10× faster than HDDs for random access)
Cluster similar vectors together (spatial locality → better caching)
Combine with RAM cache for hot vectors (90/10 rule applies)
Prefetch next batch during computation
Use madvise to give OS paging hints
Avoid for:
Real-time serving (< 10ms latency requirements)
Truly random access patterns (no cache benefits)
Frequently updated indices (mmap flush overhead)
22.5 Parallel Query Processing
Modern embedding systems serve thousands of concurrent queries. Parallel query processing distributes load across multiple cores, GPUs, and machines to achieve high throughput while maintaining low latency.
Show Parallel Query Processing
import numpy as npfrom concurrent.futures import ThreadPoolExecutor, as_completedfrom queue import Queueimport timeclass ParallelQueryProcessor:"""Process multiple queries in parallel using thread pool."""def__init__(self, vectors: np.ndarray, n_workers: int=8, batch_timeout: float=0.05):self.vectors = vectorsself.n_workers = n_workersself.batch_timeout = batch_timeoutself.executor = ThreadPoolExecutor(max_workers=n_workers)self.query_queue = Queue()def search_single(self, query: np.ndarray, k: int=10) -> np.ndarray:"""Search for a single query.""" similarities = np.dot(self.vectors, query) / ( np.linalg.norm(self.vectors, axis=1) * np.linalg.norm(query) )return np.argsort(similarities)[-k:][::-1]def search_batch(self, queries: list, k: int=10) ->list:"""Process multiple queries in parallel.""" futures = []for query in queries: future =self.executor.submit(self.search_single, query, k) futures.append(future) results = []for future in as_completed(futures): results.append(future.result())return resultsdef search_with_batching(self, queries: list, k: int=10) ->list:"""Micro-batch queries for better GPU utilization.""" batch_size =32 all_results = []for i inrange(0, len(queries), batch_size): batch = queries[i:i + batch_size]# Batch process for efficiency batch_queries = np.array(batch) similarities = np.dot(self.vectors, batch_queries.T) top_k_indices = np.argsort(similarities, axis=0)[-k:, :]for col inrange(top_k_indices.shape[1]): all_results.append(top_k_indices[:, col][::-1])return all_resultsdef shutdown(self):"""Clean up resources."""self.executor.shutdown(wait=True)# Usage examplevectors = np.random.randn(100000, 256).astype(np.float32)processor = ParallelQueryProcessor(vectors, n_workers=8)# Process 100 queries in parallelqueries = [np.random.randn(256).astype(np.float32) for _ inrange(100)]start = time.time()results = processor.search_batch(queries, k=10)elapsed = time.time() - startprint(f"Processed {len(queries)} queries in {elapsed:.3f}s")print(f"Throughput: {len(queries)/elapsed:.1f} queries/sec")processor.shutdown()
Processed 100 queries in 0.488s
Throughput: 205.0 queries/sec
TipParallel Processing Best Practices
Threading vs multiprocessing:
Use threading for I/O-bound tasks (disk, network)
Use multiprocessing for CPU-bound tasks (avoids GIL)
Use GPU for massive parallelism (thousands of threads)
Batching strategies:
Micro-batching (10-50ms window) for low latency
Macro-batching (1-10s window) for high throughput
Adaptive batching based on load
Load balancing:
Round-robin for uniform replicas
Least-load for heterogeneous replicas
Latency-weighted for geographic distribution
Health checks to detect failed replicas
Scaling limits:
CPU-bound: Scales to number of cores (8-96)
GPU-bound: Scales to GPU memory (80-640GB)
I/O-bound: Scales to disk/network bandwidth
22.6 Key Takeaways
Exact search optimizations enable real-time search at million-vector scale: SIMD vectorization, GPU acceleration, and batch processing provide 10-100× speedups over naive algorithms while maintaining zero approximation error
ANN algorithms trade minimal accuracy for massive speedups: IVF, HNSW, and product quantization achieve 95-99% recall at 100-1000× lower latency than exact search, enabling billion-vector indices with sub-millisecond response times
GPU acceleration provides 10-100× speedup for vector operations: Tensor Cores, FP16 precision, and batched matrix multiplication enable searching 100M+ vectors per second on a single A100 GPU
Memory-mapped storage handles datasets exceeding RAM: Operating system paging combined with tiered caching (hot vectors in RAM, cold on SSD) enables serving trillion-row indices on commodity hardware
Parallel query processing achieves high throughput: Thread pooling, request batching, and load balancing across replicas scale serving capacity to millions of queries per second
Production systems combine multiple optimizations: Successful deployments use ANN + GPU + memory mapping + parallel processing together, with each optimization addressing different bottlenecks
The optimization hierarchy: Algorithm choice (1000× impact) > Hardware acceleration (100× impact) > Parallelism (10× impact) > Implementation details (2× impact). Choose the right algorithm before micro-optimizing
22.7 Looking Ahead
This chapter covered computational optimizations for vector operations in isolation. Chapter 23 expands the view to data engineering for embeddings at scale: ETL pipelines that ingest and transform raw data for embedding generation, streaming systems for real-time embedding updates, data quality validation that ensures training stability, schema evolution strategies for backwards compatibility, and multi-source data fusion that combines embeddings across diverse datasets. These data engineering practices ensure embedding systems have the high-quality, well-structured data needed to reach their full potential.
22.8 Further Reading
22.8.1 Similarity Search Algorithms
Johnson, Jeff, et al. (2019). “Billion-scale similarity search with GPUs.” IEEE Transactions on Big Data.
Malkov, Yury, and D. A. Yashunin (2018). “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” IEEE TPAMI.
Jégou, Hervé, et al. (2011). “Product Quantization for Nearest Neighbor Search.” IEEE TPAMI.
22.8.2 Approximate Nearest Neighbors
Andoni, Alexandr, and Piotr Indyk (2006). “Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions.” FOCS.
Baranchuk, Dmitry, et al. (2019). “Learning to Route in Similarity Graphs.” ICML.
Aumüller, Martin, et al. (2020). “ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms.” Information Systems.
22.8.3 GPU Acceleration
NVIDIA (2020). “CUDA C++ Programming Guide.”
Harris, Mark (2007). “Optimizing Parallel Reduction in CUDA.” NVIDIA Developer.
Sismanis, Nikos, et al. (2021). “Efficient Large-Scale Approximate Nearest Neighbor Search on OpenCL FPGA.” VLDB.
22.8.4 Memory Management
Boroumand, Amirali, et al. (2018). “Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks.” ASPLOS.
Lim, Kevin, et al. (2009). “Disaggregated Memory for Expansion and Sharing in Blade Servers.” ISCA.