Production embedding systems rarely use single, off-the-shelf embeddings. This chapter covers the advanced patterns that power real-world systems: hybrid vectors combining multiple feature types, multi-vector representations for fine-grained matching, learned sparse embeddings for interpretability, and domain-specific patterns for security, time-series, and structured data. These patterns build on the foundational types covered in Chapters 4-9.
10.1 Beyond Single Embeddings
The foundational embedding types—text, image, audio, and others—serve as building blocks. Production systems combine, extend, and specialize these foundations in sophisticated ways:
Hybrid embeddings combine semantic, categorical, numerical, and domain-specific features
Multi-vector representations use multiple embeddings per item for fine-grained matching
Learned sparse embeddings balance dense semantics with interpretable sparse features
Specialized architectures optimize for specific retrieval patterns
Understanding these patterns is essential for building embedding systems that perform well on real-world data.
10.2 Hybrid and Composite Embeddings
Real-world entities have multiple facets that single embeddings can’t capture. A security log has semantic content (message text), categorical features (event type, severity), numerical features (byte counts, durations), and domain-specific features (IP addresses). Hybrid embeddings combine all of these.
10.2.1 The Naive Approach Fails
Simple concatenation doesn’t work:
"""Why Naive Concatenation FailsWhen combining embeddings of different dimensions, larger vectorsdominate similarity calculations, drowning out smaller features."""import numpy as npfrom sklearn.metrics.pairwise import cosine_similaritynp.random.seed(42)# Simulate: 384-dim text embedding + 10-dim numerical featurestext_embedding = np.random.randn(384)numerical_features = np.array([0.5, 0.8, 0.2, 0.1, 0.9, 0.3, 0.7, 0.4, 0.6, 0.5])# Naive concatenationnaive_hybrid = np.concatenate([text_embedding, numerical_features])# The problem: text embedding dominatestext_magnitude = np.linalg.norm(text_embedding)num_magnitude = np.linalg.norm(numerical_features)print("Magnitude comparison:")print(f" Text embedding (384 dims): {text_magnitude:.2f}")print(f" Numerical features (10 dims): {num_magnitude:.2f}")print(f" Ratio: {text_magnitude/num_magnitude:.1f}x")print("\nThe text embedding will dominate similarity calculations!")
Magnitude comparison:
Text embedding (384 dims): 18.67
Numerical features (10 dims): 1.76
Ratio: 10.6x
The text embedding will dominate similarity calculations!
10.2.2 Weighted Normalized Concatenation
The solution: normalize each component, then apply importance weights:
"""Weighted Normalized ConcatenationProperly combines multiple feature types by:1. L2-normalizing each component independently2. Applying learned or tuned weights3. Concatenating the weighted, normalized components"""import numpy as npfrom sklearn.preprocessing import normalizenp.random.seed(42)def create_hybrid_embedding( text_embedding: np.ndarray, categorical_embedding: np.ndarray, numerical_features: np.ndarray, domain_features: np.ndarray, weights: dict) -> np.ndarray:""" Create a hybrid embedding from multiple feature types. Args: text_embedding: Semantic embedding from text encoder (e.g., 384 dims) categorical_embedding: Learned embeddings for categorical features numerical_features: Scaled numerical features domain_features: Domain-specific features (e.g., IP encoding) weights: Importance weights for each component (should sum to 1.0) Returns: Hybrid embedding vector """# L2-normalize each component text_norm = normalize(text_embedding.reshape(1, -1))[0] cat_norm = normalize(categorical_embedding.reshape(1, -1))[0] num_norm = normalize(numerical_features.reshape(1, -1))[0] domain_norm = normalize(domain_features.reshape(1, -1))[0]# Apply weights and concatenate hybrid = np.concatenate([ text_norm * weights['text'], cat_norm * weights['categorical'], num_norm * weights['numerical'], domain_norm * weights['domain'] ])return hybrid# Example: Security log embeddingtext_emb = np.random.randn(384) # From sentence transformercat_emb = np.random.randn(32) # Learned embeddings for event_type, severitynum_feat = np.random.randn(10) # Scaled: bytes_in, bytes_out, durationdomain_feat = np.array([0.75, 0.65, 0.003, 0.039, 1.0]) # IP octets + is_private# Weights are hyperparameters to tuneweights = {'text': 0.50, # Semantic content is most important'categorical': 0.20, # Event type matters'numerical': 0.15, # Metrics provide context'domain': 0.15# IP information for security}hybrid = create_hybrid_embedding( text_emb, cat_emb, num_feat, domain_feat, weights)print(f"Hybrid embedding dimension: {len(hybrid)}")print(f" Text component: 384 dims × {weights['text']} weight")print(f" Categorical: 32 dims × {weights['categorical']} weight")print(f" Numerical: 10 dims × {weights['numerical']} weight")print(f" Domain: 5 dims × {weights['domain']} weight")
When to use multi-vector: - Fine-grained matching matters (exact phrase matching) - Documents are long and diverse - You can afford 10-100x storage overhead
10.4 Matryoshka Embeddings
Matryoshka (nested doll) embeddings encode information hierarchically—the first N dimensions are a valid embedding on their own:
"""Matryoshka Embeddings: Variable-Length RepresentationsThe first N dimensions form a valid embedding for any N.Trade off quality vs. efficiency at query time."""import numpy as npfrom sklearn.metrics.pairwise import cosine_similarity# Simulate Matryoshka embeddings (trained to work at multiple dimensions)np.random.seed(42)def simulate_matryoshka_embedding(text: str, full_dim: int=768) -> np.ndarray:""" Simulate a Matryoshka embedding where prefixes are valid embeddings. Real models are trained with a special loss to ensure this property. """ np.random.seed(hash(text) %2**32)return np.random.randn(full_dim)texts = ["machine learning for natural language processing","deep learning NLP models","cooking italian pasta recipes",]embeddings = [simulate_matryoshka_embedding(t) for t in texts]# Compare at different dimension prefixesprint("Similarity at different dimensions:\n")for dim in [64, 128, 256, 768]: truncated = [e[:dim] for e in embeddings] sim_01 = cosine_similarity([truncated[0]], [truncated[1]])[0][0] sim_02 = cosine_similarity([truncated[0]], [truncated[2]])[0][0]print(f" {dim} dims: ML↔DL={sim_01:.3f}, ML↔Cooking={sim_02:.3f}")
Similarity at different dimensions:
64 dims: ML↔DL=-0.113, ML↔Cooking=-0.078
128 dims: ML↔DL=-0.075, ML↔Cooking=-0.143
256 dims: ML↔DL=-0.084, ML↔Cooking=-0.151
768 dims: ML↔DL=-0.080, ML↔Cooking=-0.058
Benefits of Matryoshka embeddings: - Use short prefixes for fast initial retrieval - Use full dimensions for final reranking - Adapt to latency/quality requirements at runtime - Reduce storage by storing only needed dimensions
10.5 Learned Sparse Embeddings
SPLADE and similar models learn sparse representations that combine the best of dense and sparse retrieval:
"""Learned Sparse Embeddings (SPLADE-style)Learn to predict which vocabulary terms are important for a document.Results in sparse vectors with interpretable dimensions (actual words)."""import numpy as npdef simulate_splade_embedding(text: str, vocab_size: int=30000) ->dict:""" Simulate SPLADE-style sparse embedding. Returns dict mapping vocabulary indices to importance weights. Real SPLADE uses a transformer to predict term importance. """ words = text.lower().split() sparse = {} np.random.seed(hash(text) %2**32)for word in words:# Simulate vocabulary index idx =hash(word) % vocab_size# Simulate learned importance weight weight = np.random.exponential(1.0) sparse[idx] =max(sparse.get(idx, 0), weight)# SPLADE also expands to related termsfor _ inrange(len(words)): expanded_idx = np.random.randint(vocab_size) sparse[expanded_idx] = np.random.exponential(0.5)return sparsedef sparse_dot_product(sparse1: dict, sparse2: dict) ->float:"""Compute dot product of two sparse vectors.""" score =0.0for idx, weight1 in sparse1.items():if idx in sparse2: score += weight1 * sparse2[idx]return score# Examplequery ="machine learning models"doc1 ="neural network deep learning"doc2 ="kitchen cooking recipes"q_sparse = simulate_splade_embedding(query)d1_sparse = simulate_splade_embedding(doc1)d2_sparse = simulate_splade_embedding(doc2)print(f"Query sparse embedding: {len(q_sparse)} non-zero terms")print(f"Doc 1 sparse embedding: {len(d1_sparse)} non-zero terms")print(f"Doc 2 sparse embedding: {len(d2_sparse)} non-zero terms")print(f"\nQuery ↔ Doc 1 (related): {sparse_dot_product(q_sparse, d1_sparse):.2f}")print(f"Query ↔ Doc 2 (unrelated): {sparse_dot_product(q_sparse, d2_sparse):.2f}")
Benefits of learned sparse: - Interpretable (dimensions correspond to vocabulary terms) - Works with inverted indices (fast exact matching) - Captures term expansion (related terms) - Combines well with dense embeddings (hybrid search)
10.6 Time-Series Pattern Embeddings
Beyond basic statistical features, production systems use learned representations for time-series patterns.
10.6.1 ROCKET: Random Convolutional Kernels
ROCKET transforms time-series into features using random convolutional kernels:
"""ROCKET-Style Time-Series EmbeddingsUses random convolutional kernels to extract features from time-series.Fast to compute, works well for classification and similarity."""import numpy as npdef generate_random_kernels(n_kernels: int=100, max_length: int=9) ->list:"""Generate random convolutional kernels.""" np.random.seed(42) kernels = []for _ inrange(n_kernels): length = np.random.choice([3, 5, 7, 9]) weights = np.random.randn(length) bias = np.random.randn() dilation = np.random.choice([1, 2, 4]) kernels.append((weights, bias, dilation))return kernelsdef apply_kernel(series: np.ndarray, kernel: tuple) ->tuple:"""Apply a single kernel and extract features (max, ppv).""" weights, bias, dilation = kernel length =len(weights)# Dilated convolution output = []for i inrange(len(series) - (length -1) * dilation): indices = [i + j * dilation for j inrange(length)] value = np.dot(series[indices], weights) + bias output.append(value) output = np.array(output)# ROCKET features: max value and proportion of positive values (PPV) max_val = np.max(output) iflen(output) >0else0 ppv = np.mean(output >0) iflen(output) >0else0return max_val, ppvdef rocket_embedding(series: np.ndarray, kernels: list) -> np.ndarray:"""Create ROCKET embedding from time-series.""" features = []for kernel in kernels: max_val, ppv = apply_kernel(series, kernel) features.extend([max_val, ppv])return np.array(features)# Generate kernels (done once)kernels = generate_random_kernels(n_kernels=50)# Example time-series patternst = np.linspace(0, 4*np.pi, 100)patterns = {'sine': np.sin(t) + np.random.randn(100) *0.1,'cosine': np.cos(t) + np.random.randn(100) *0.1,'trend_up': t/10+ np.random.randn(100) *0.2,'random': np.random.randn(100),}# Create embeddingsembeddings = {name: rocket_embedding(series, kernels)for name, series in patterns.items()}print(f"ROCKET embedding dimension: {len(embeddings['sine'])}")print(f" ({len(kernels)} kernels × 2 features each)")# Compare patternsfrom sklearn.metrics.pairwise import cosine_similarityprint("\nPattern similarities:")print(f" sine ↔ cosine: {cosine_similarity([embeddings['sine']], [embeddings['cosine']])[0][0]:.3f}")print(f" sine ↔ trend: {cosine_similarity([embeddings['sine']], [embeddings['trend_up']])[0][0]:.3f}")print(f" sine ↔ random: {cosine_similarity([embeddings['sine']], [embeddings['random']])[0][0]:.3f}")
ROCKET embedding dimension: 100
(50 kernels × 2 features each)
Pattern similarities:
sine ↔ cosine: 0.998
sine ↔ trend: 0.828
sine ↔ random: 0.894
10.6.2 Learned Temporal Embeddings
For more complex patterns, use neural networks:
"""Learned Temporal EmbeddingsUse LSTMs, Transformers, or Temporal CNNs to learn time-series representations.This example shows a simplified LSTM-style encoding."""import numpy as npclass SimpleTemporalEncoder:""" Simplified temporal encoder for illustration. Production systems use PyTorch/TensorFlow LSTM or Transformer. """def__init__(self, hidden_dim: int=64):self.hidden_dim = hidden_dim np.random.seed(42)# Simplified: project statistics to hidden spaceself.projection = np.random.randn(10, hidden_dim) *0.1def encode(self, series: np.ndarray) -> np.ndarray:"""Encode time-series to fixed-length embedding."""# Extract temporal features features = np.array([ np.mean(series), np.std(series), np.min(series), np.max(series), np.mean(np.diff(series)), # Trend np.std(np.diff(series)), # Volatility np.corrcoef(series[:-1], series[1:])[0, 1], # Autocorrelationlen(np.where(np.diff(np.sign(series)))[0]), # Zero crossings np.percentile(series, 25), np.percentile(series, 75), ]) features = np.nan_to_num(features)# Project to embedding space embedding = np.tanh(features @self.projection)return embeddingencoder = SimpleTemporalEncoder(hidden_dim=64)# Encode different patternst = np.linspace(0, 4*np.pi, 100)embeddings = {'periodic': encoder.encode(np.sin(t)),'trending': encoder.encode(t /10),'volatile': encoder.encode(np.random.randn(100)),}print(f"Temporal embedding dimension: {len(embeddings['periodic'])}")
Temporal embedding dimension: 64
10.7 Binary and Quantized Embeddings
For massive scale, compress embeddings to reduce storage and accelerate search:
"""Binary and Quantized EmbeddingsCompress embeddings for efficiency:- Binary: Each dimension → 1 bit (32x compression)- Product Quantization: Learn codebooks for compression"""import numpy as npdef binarize_embedding(embedding: np.ndarray) -> np.ndarray:"""Convert to binary embedding (sign of each dimension)."""return (embedding >0).astype(np.int8)def hamming_distance(bin1: np.ndarray, bin2: np.ndarray) ->int:"""Hamming distance between binary vectors."""return np.sum(bin1 != bin2)def hamming_similarity(bin1: np.ndarray, bin2: np.ndarray) ->float:"""Normalized Hamming similarity (0 to 1)."""return1- hamming_distance(bin1, bin2) /len(bin1)# Example: Compare binary vs float embeddingsnp.random.seed(42)emb1 = np.random.randn(768)emb2 = emb1 + np.random.randn(768) *0.5# Similaremb3 = np.random.randn(768) # Different# Float similarityfrom sklearn.metrics.pairwise import cosine_similarityfloat_sim_12 = cosine_similarity([emb1], [emb2])[0][0]float_sim_13 = cosine_similarity([emb1], [emb3])[0][0]# Binary similaritybin1, bin2, bin3 = [binarize_embedding(e) for e in [emb1, emb2, emb3]]bin_sim_12 = hamming_similarity(bin1, bin2)bin_sim_13 = hamming_similarity(bin1, bin3)print("Float vs Binary similarity comparison:")print(f"\n Similar pair:")print(f" Float cosine: {float_sim_12:.3f}")print(f" Binary Hamming: {bin_sim_12:.3f}")print(f"\n Different pair:")print(f" Float cosine: {float_sim_13:.3f}")print(f" Binary Hamming: {bin_sim_13:.3f}")print(f"\nStorage comparison for 768-dim embedding:")print(f" Float32: {768*4} bytes")print(f" Binary: {768//8} bytes ({768*4/ (768//8):.0f}x compression)")
Float vs Binary similarity comparison:
Similar pair:
Float cosine: 0.894
Binary Hamming: 0.859
Different pair:
Float cosine: -0.016
Binary Hamming: 0.504
Storage comparison for 768-dim embedding:
Float32: 3072 bytes
Binary: 96 bytes (32x compression)
When to use quantized embeddings: - Billions of vectors (storage constraints) - Latency-critical applications - First-stage retrieval (rerank with full precision) - Edge deployment
10.8 Session and Behavioral Embeddings
Embed user sessions and behaviors as sequences:
"""Session and Behavioral EmbeddingsEmbed sequences of user actions to capture behavioral patterns.Similar sessions (browsing patterns) get similar embeddings."""import numpy as npfrom sklearn.metrics.pairwise import cosine_similarityclass SessionEncoder:"""Encode user sessions as embeddings."""def__init__(self, action_vocab: list, embedding_dim: int=64):self.action_vocab = {a: i for i, a inenumerate(action_vocab)}self.embedding_dim = embedding_dim np.random.seed(42)# Action embeddings (would be learned)self.action_embeddings = np.random.randn(len(action_vocab), embedding_dim) *0.1def encode_session(self, actions: list) -> np.ndarray:"""Encode a session (sequence of actions) to single embedding."""ifnot actions:return np.zeros(self.embedding_dim)# Get embeddings for each action action_embs = []for action in actions:if action inself.action_vocab: idx =self.action_vocab[action] action_embs.append(self.action_embeddings[idx])ifnot action_embs:return np.zeros(self.embedding_dim)# Combine with weighted average (recent actions weighted more) weights = np.exp(np.linspace(-1, 0, len(action_embs))) weights /= weights.sum() session_emb = np.average(action_embs, axis=0, weights=weights)return session_emb# Define action vocabularyactions = ['view_product', 'add_to_cart', 'remove_from_cart','view_category', 'search', 'checkout', 'view_reviews']encoder = SessionEncoder(actions)# Example sessionsshopping_session = ['view_category', 'view_product', 'view_reviews','add_to_cart', 'view_product', 'add_to_cart', 'checkout']browsing_session = ['view_category', 'view_product', 'view_category','search', 'view_product', 'view_category']cart_abandon = ['view_product', 'add_to_cart', 'view_product','add_to_cart', 'remove_from_cart']emb_shopping = encoder.encode_session(shopping_session)emb_browsing = encoder.encode_session(browsing_session)emb_abandon = encoder.encode_session(cart_abandon)print("Session similarities:")print(f" Shopping ↔ Browsing: {cosine_similarity([emb_shopping], [emb_browsing])[0][0]:.3f}")print(f" Shopping ↔ Cart abandon: {cosine_similarity([emb_shopping], [emb_abandon])[0][0]:.3f}")print(f" Browsing ↔ Cart abandon: {cosine_similarity([emb_browsing], [emb_abandon])[0][0]:.3f}")
Naive concatenation fails when combining embeddings of different sizes—use weighted, normalized concatenation
Entity embeddings for categorical features outperform one-hot encoding by learning relationships between categories
Multi-vector representations (ColBERT) provide fine-grained matching at the cost of storage
Matryoshka embeddings enable quality/latency trade-offs at query time
Learned sparse embeddings (SPLADE) combine interpretability with semantic matching
Time-series patterns can be captured with ROCKET (fast, simple) or learned encoders (more expressive)
Domain-specific embeddings like security logs require thoughtful combination of semantic, categorical, numerical, and specialized features
10.12 Looking Ahead
This completes Part II on embedding types. Chapter 11 begins Part III: Core Applications, showing how to build retrieval-augmented generation systems that put these embeddings to work. For training custom embeddings with these patterns, Chapter 14 in Part IV provides guidance on when to build versus fine-tune.
10.13 Further Reading
Khattab, O. & Zaharia, M. (2020). “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR
Kusupati, A., et al. (2022). “Matryoshka Representation Learning.” NeurIPS
Formal, T., et al. (2021). “SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.” SIGIR
Dempster, A., et al. (2020). “ROCKET: Exceptionally Fast and Accurate Time Series Classification Using Random Convolutional Kernels.” Data Mining and Knowledge Discovery
Guo, C., et al. (2016). “Entity Embeddings of Categorical Variables.” arXiv:1604.06737