Production embedding systems rarely use single, off-the-shelf embeddings. This chapter covers the advanced patterns that power real-world systems: hybrid vectors combining multiple feature types, multi-vector representations for fine-grained matching, learned sparse embeddings for interpretability, and domain-specific patterns for security, time-series, and structured data. These patterns build on the foundational types covered in Chapters 4-9.
10.1 Beyond Single Embeddings
The foundational embedding types—text, image, audio, and others—serve as building blocks. Production systems combine, extend, and specialize these foundations in sophisticated ways:
Hybrid embeddings combine semantic, categorical, numerical, and domain-specific features
Multi-vector representations use multiple embeddings per item for fine-grained matching
Learned sparse embeddings balance dense semantics with interpretable sparse features
Specialized architectures optimize for specific retrieval patterns
Understanding these patterns is essential for building embedding systems that perform well on real-world data.
10.2 Hybrid and Composite Embeddings
Real-world entities have multiple facets that single embeddings can’t capture. A security log has semantic content (message text), categorical features (event type, severity), numerical features (byte counts, durations), and domain-specific features (IP addresses). Hybrid embeddings combine all of these.
10.2.1 The Naive Approach Fails
Simple concatenation doesn’t work:
"""Why Naive Concatenation FailsWhen combining embeddings of different dimensions, larger vectorsdominate similarity calculations, drowning out smaller features."""import numpy as npfrom sklearn.metrics.pairwise import cosine_similaritynp.random.seed(42)# Simulate: 384-dim text embedding + 10-dim numerical featurestext_embedding = np.random.randn(384)numerical_features = np.array([0.5, 0.8, 0.2, 0.1, 0.9, 0.3, 0.7, 0.4, 0.6, 0.5])# Naive concatenationnaive_hybrid = np.concatenate([text_embedding, numerical_features])# The problem: text embedding dominatestext_magnitude = np.linalg.norm(text_embedding)num_magnitude = np.linalg.norm(numerical_features)print("Magnitude comparison:")print(f" Text embedding (384 dims): {text_magnitude:.2f}")print(f" Numerical features (10 dims): {num_magnitude:.2f}")print(f" Ratio: {text_magnitude/num_magnitude:.1f}x")print("\nThe text embedding will dominate similarity calculations!")
Magnitude comparison:
Text embedding (384 dims): 18.67
Numerical features (10 dims): 1.76
Ratio: 10.6x
The text embedding will dominate similarity calculations!
10.2.2 Weighted Normalized Concatenation
The solution: normalize each component, then apply importance weights:
"""Weighted Normalized ConcatenationProperly combines multiple feature types by:1. L2-normalizing each component independently2. Applying learned or tuned weights3. Concatenating the weighted, normalized components"""import numpy as npfrom sklearn.preprocessing import normalizenp.random.seed(42)def create_hybrid_embedding( text_embedding: np.ndarray, categorical_embedding: np.ndarray, numerical_features: np.ndarray, domain_features: np.ndarray, weights: dict) -> np.ndarray:""" Create a hybrid embedding from multiple feature types. Args: text_embedding: Semantic embedding from text encoder (e.g., 384 dims) categorical_embedding: Learned embeddings for categorical features numerical_features: Scaled numerical features domain_features: Domain-specific features (e.g., IP encoding) weights: Importance weights for each component (should sum to 1.0) Returns: Hybrid embedding vector """# L2-normalize each component text_norm = normalize(text_embedding.reshape(1, -1))[0] cat_norm = normalize(categorical_embedding.reshape(1, -1))[0] num_norm = normalize(numerical_features.reshape(1, -1))[0] domain_norm = normalize(domain_features.reshape(1, -1))[0]# Apply weights and concatenate hybrid = np.concatenate([ text_norm * weights['text'], cat_norm * weights['categorical'], num_norm * weights['numerical'], domain_norm * weights['domain'] ])return hybrid# Example: Security log embeddingtext_emb = np.random.randn(384) # From sentence transformercat_emb = np.random.randn(32) # Learned embeddings for event_type, severitynum_feat = np.random.randn(10) # Scaled: bytes_in, bytes_out, durationdomain_feat = np.array([0.75, 0.65, 0.003, 0.039, 1.0]) # IP octets + is_private# Weights are hyperparameters to tuneweights = {'text': 0.50, # Semantic content is most important'categorical': 0.20, # Event type matters'numerical': 0.15, # Metrics provide context'domain': 0.15# IP information for security}hybrid = create_hybrid_embedding( text_emb, cat_emb, num_feat, domain_feat, weights)print(f"Hybrid embedding dimension: {len(hybrid)}")print(f" Text component: 384 dims × {weights['text']} weight")print(f" Categorical: 32 dims × {weights['categorical']} weight")print(f" Numerical: 10 dims × {weights['numerical']} weight")print(f" Domain: 5 dims × {weights['domain']} weight")
Tuning hybrid embedding weights: The weights (0.50 text, 0.20 categorical, 0.15 numerical, 0.15 domain) are critical hyperparameters that determine the final embedding’s behavior. Don’t use equal weights—they ignore the relative importance of each feature type.
Three approaches to finding optimal weights:
Grid search: Try different weight combinations on a validation set measuring your downstream task (classification accuracy, retrieval recall, etc.). Start coarse (0.1 increments) then refine around the best region.
Learned weights: Make weights trainable parameters in your model. Initialize near your intuition (0.5, 0.2, 0.15, 0.15), then let backpropagation optimize them. Add a softmax constraint to ensure they sum to 1.0.
Task-specific tuning: For anomaly detection, boost numerical features (bytes, durations often reveal attacks). For semantic search, boost text. For compliance filtering, boost categorical (severity, event type).
Validation is essential: Measure downstream task performance, not just embedding similarity. A hybrid embedding optimized for retrieval recall may differ from one optimized for classification accuracy. See Chapter 14 for multi-objective optimization strategies.
10.2.3 Entity Embeddings for Categorical Features
Categorical features like “event type” or “product category” are often encoded as one-hot vectors—sparse, high-dimensional, and unable to capture relationships. Entity embeddings offer a better approach: learn dense, low-dimensional representations where similar categories have similar embeddings.
Why entity embeddings work better than one-hot:
Dimensionality: 7 categories → 7-dim one-hot vs 8-dim learned embedding (similar size, but embedding is dense)
Relationships: Captures that “login” and “logout” are related, while one-hot treats all categories as equally distant
Generalization: Rare categories benefit from learned structure
Integration: Learned embeddings combine smoothly with other features in hybrid embeddings
How it works: Each categorical value gets an embedding vector (trainable parameters). During training, the model learns to position similar categories near each other in embedding space. In production, use PyTorch’s nn.Embedding or TensorFlow’s tf.keras.layers.Embedding.
Training entity embeddings: These embeddings must be learned from your data—they’re not pre-trained. You have two approaches:
End-to-end training: Include nn.Embedding layers in your neural network and train on your downstream task (classification, ranking, etc.). The model learns to position similar categories nearby based on task performance.
Pre-training with co-occurrence: If you have large unlabeled datasets, train embeddings by predicting co-occurrence patterns (e.g., which event types tend to appear together in sequences). This is analogous to Word2Vec but for categorical features.
Practical considerations:
Dimensionality: Start with min(50, num_categories // 2) as a heuristic, then tune as a hyperparameter
Regularization: Use dropout on embeddings to prevent overfitting rare categories
Cold start: For new categories not seen during training, use the mean embedding or a learned “unknown” embedding
Sharing: Categories that appear in similar contexts should share embeddings (e.g., event_type embeddings shared across all security products)
See Chapter 14 for detailed guidance on training categorical embeddings and Chapter 20 for handling large categorical vocabularies efficiently.
"""Entity Embeddings for Categorical FeaturesLearn dense representations for categorical values instead of sparse one-hot.This captures relationships between categories (e.g., similar event types)."""import numpy as np# Simulated learned embeddings for categorical features# In practice, use nn.Embedding in PyTorch/TensorFlowclass CategoryEmbedder:"""Simple category embedder (production would use nn.Embedding)."""def__init__(self, categories: list, embedding_dim: int=8):self.categories = {cat: i for i, cat inenumerate(categories)}self.embedding_dim = embedding_dim# Initialize random embeddings (would be learned in practice) np.random.seed(42)self.embeddings = np.random.randn(len(categories), embedding_dim) *0.1def embed(self, category: str) -> np.ndarray: idx =self.categories.get(category, 0)returnself.embeddings[idx]# Example: Event type embeddings for security logsevent_types = ['login', 'logout', 'file_access', 'network_connection','process_start', 'process_end', 'privilege_escalation']severity_levels = ['info', 'warning', 'error', 'critical']event_embedder = CategoryEmbedder(event_types, embedding_dim=8)severity_embedder = CategoryEmbedder(severity_levels, embedding_dim=4)# Embed categorical featuresevent_emb = event_embedder.embed('login')severity_emb = severity_embedder.embed('warning')# Combine into categorical embeddingcategorical_embedding = np.concatenate([event_emb, severity_emb])print(f"Event embedding shape: {event_emb.shape}")print(f"Severity embedding shape: {severity_emb.shape}")print(f"Combined categorical embedding: {categorical_embedding.shape}")
Numerical features need careful preprocessing before embedding. Raw numerical values often span wildly different scales (bytes in millions, durations in milliseconds) and follow long-tail distributions. Without preprocessing, large-scale features dominate similarity calculations—the same problem as naive concatenation. Proper preprocessing ensures each feature contributes proportionally:
"""Numerical Feature Preprocessing PipelineProper preprocessing for numerical features:1. Handle missing values2. Apply log transform for long-tail distributions3. Standardize to zero mean, unit variance4. L2-normalize the result"""import numpy as npfrom sklearn.preprocessing import StandardScalerclass NumericalPreprocessor:"""Preprocess numerical features for embedding."""def__init__(self, feature_names: list):self.feature_names = feature_namesself.scaler = StandardScaler()self.fitted =Falsedef fit(self, data: np.ndarray):"""Fit the scaler on training data."""# Apply log1p for long-tail features (bytes, counts) log_data = np.log1p(np.clip(data, 0, None))self.scaler.fit(log_data)self.fitted =Truereturnselfdef transform(self, data: np.ndarray) -> np.ndarray:"""Transform and normalize numerical features."""# Handle missing values data = np.nan_to_num(data, nan=0.0)# Log transform for long-tail distributions log_data = np.log1p(np.clip(data, 0, None))# Standardizeifself.fitted: scaled =self.scaler.transform(log_data.reshape(1, -1))[0]else: scaled = log_datareturn scaled# Example: Network metricsfeature_names = ['bytes_in', 'bytes_out', 'duration_ms', 'packet_count']preprocessor = NumericalPreprocessor(feature_names)# Simulate training data for fittingtrain_data = np.array([ [1024, 2048, 150, 10], [1000000, 500000, 5000, 1000], # Long-tail values [512, 1024, 50, 5],])preprocessor.fit(train_data)# Transform new data pointnew_data = np.array([50000, 25000, 200, 50])processed = preprocessor.transform(new_data)print("Original features:", new_data)print("Processed features:", np.round(processed, 3))
Single-vector embeddings face a fundamental trade-off: compress an entire document into one fixed-length vector, inevitably losing detail. For long documents or when precise phrase matching matters, this compression is too lossy. Multi-vector representations solve this by using multiple embeddings per item.
The key insight: Instead of one 768-dim vector per document, use N vectors of 128 dims (one per token or sentence). Matching happens at the token level—find which query tokens match which document tokens. This preserves fine-grained information at the cost of 10-100x storage.
10.3.1 ColBERT-Style Late Interaction
ColBERT (Contextualized Late Interaction over BERT) pioneered this approach for document retrieval. Instead of encoding a document into a single vector, ColBERT produces a matrix where each row is the contextualized embedding of a token.
How late interaction works:
Encode: Pass query and document through BERT, producing one vector per token
Index: Store all document token vectors (this is the storage cost)
Retrieve: For each query token, find its max similarity to any document token
Score: Sum these max similarities—this is the document’s relevance score
Why it works: Allows exact phrase matching (if query token “machine” matches document token “machine” strongly, that’s captured) while still using semantic embeddings. Much more accurate than single-vector for long documents or when specific terms matter.
Use libraries like colbert-ir or RAGatouille for production implementations
Fine-tuning ColBERT for your domain: While you can use pre-trained ColBERT models, domain-specific fine-tuning significantly improves accuracy. Legal documents, medical records, and code repositories have specialized vocabularies and matching patterns that generic models miss.
Training approach:
Start with pre-trained ColBERT: Initialize from a model trained on MS MARCO or similar large corpus
Gather domain data: Collect query-document pairs with relevance labels (clicks, ratings, or judgments)
Contrastive training: For each query, train to score relevant documents higher than irrelevant ones
Storage-quality trade-off: During fine-tuning, you can reduce embedding dimensions (128→64) to save storage, though this trades some accuracy
Fine-tuning typically requires 10K-100K labeled query-document pairs and 1-2 days on a single GPU. See Chapter 14 for guidance on when to fine-tune vs. use off-the-shelf models.
10.4 Matryoshka Embeddings
Traditional embeddings force a hard choice: use 384 dimensions (expensive storage, slow search) or 128 dimensions (cheaper but less accurate). Matryoshka embeddings eliminate this trade-off by encoding information hierarchically—the first N dimensions form a valid embedding for any N.
How they work: Models are trained with a special multi-scale loss function. During training, the loss is computed not just on the full 768-dim embedding, but also on prefixes (first 64, 128, 256, 384 dims). This forces the model to pack the most important information into early dimensions, with refinements in later dimensions.
The key property: You can truncate a 768-dim Matryoshka embedding to 128 dims and still get semantically meaningful results. Unlike simply training a 128-dim model (which might be more accurate at 128 dims), Matryoshka gives you flexibility—one model, multiple dimension options.
Benefits of Matryoshka embeddings:
Use short prefixes for fast initial retrieval
Use full dimensions for final reranking
Adapt to latency/quality requirements at runtime
Reduce storage by storing only needed dimensions
Available models:
Nomic AI’s nomic-embed-text-v1.5 (768→64 dims)
Voyage AI’s models support variable dimensions
Sentence Transformers with Matryoshka training
10.5 Learned Sparse Embeddings
Dense embeddings (768 floats) excel at semantic matching but lack interpretability and require specialized vector databases. Sparse retrieval (BM25) is interpretable and uses standard inverted indices, but misses semantic relationships. Learned sparse embeddings like SPLADE combine both: use transformers to create sparse vectors with interpretable dimensions.
The innovation: Instead of encoding text into a dense 768-dim vector, predict an importance weight for each vocabulary term (30,000 terms). Most weights are zero—you get a sparse vector with 100-200 non-zero entries. Dimensions correspond to actual words, so you can see which terms matter.
Why this is powerful:
Semantic expansion: The model learns to activate related terms not in the original text. Query “ML models” activates dimensions for “machine learning,” “neural networks,” “deep learning”
Interpretability: You can inspect which vocabulary terms fired and why
Standard indexing: Sparse vectors work with inverted indices—no need for specialized HNSW or IVF indexes
Hybrid search: Combine with dense embeddings for best of both worlds
How it differs from BM25: BM25 activates exact term matches. SPLADE uses a transformer to learn which related terms should activate and with what weights. This captures semantics while maintaining sparsity.
How it works:
Pass text through a transformer encoder
For each vocabulary term, predict an importance weight
Result is a sparse vector (typically 100-200 non-zero terms out of 30K vocabulary)
Can be indexed with inverted indices for efficient retrieval
Benefits of learned sparse:
Interpretable (dimensions correspond to vocabulary terms)
Works with inverted indices (fast exact matching)
Captures term expansion (related terms automatically included)
Combines well with dense embeddings (hybrid search)
Available implementations:
Primal library for SPLADE models
Pinecone and Qdrant support hybrid sparse+dense search
Training SPLADE models: Unlike off-the-shelf dense embeddings, SPLADE models require training on your domain to learn effective term expansion patterns. The model must learn which vocabulary terms are semantically related and should co-activate.
Training approach:
Architecture: Start with a pre-trained BERT/RoBERTa encoder, add a vocabulary projection layer (768 hidden dims → 30K vocab terms) with ReLU activation and log saturation to enforce sparsity
Loss function: Use contrastive loss over query-document pairs—maximize scores for relevant pairs, minimize for irrelevant pairs. Add FLOPS regularization to penalize excessive term activation (controls sparsity)
Data requirements: Need query-document pairs with relevance judgments. 100K-1M pairs for general domain, 10K-100K for specialized domains where pre-training helps
Sparsity trade-off: FLOPS regularization hyperparameter controls sparsity vs. quality. High regularization → 50-100 active terms (fast, interpretable). Low regularization → 200-300 active terms (more accurate, less sparse)
Domain adaptation: Can fine-tune existing SPLADE models (trained on MS MARCO) for your domain with 10K-50K domain-specific pairs. This adapts term expansion patterns—medical SPLADE learns “MI” expands to “myocardial infarction”, “heart attack”, etc.
See Chapter 14 for guidance on training sparse models and Chapter 20 for handling large vocabularies efficiently.
10.6 Time-Series Pattern Embeddings
Time-series data presents unique challenges for embeddings. Unlike text or images where pre-trained models excel, time-series patterns are highly domain-specific—a “normal” pattern in network traffic looks nothing like a “normal” pattern in heart rate data. Effective time-series embeddings must capture temporal structure: trends, seasonality, sudden changes, and oscillations.
What makes time-series different:
Variable length: Sensor readings might have 100 or 10,000 time steps
Temporal dependencies: The order matters—shuffling time steps destroys meaning
Scale sensitivity: Amplitude, frequency, and phase all carry information
Domain-specific patterns: What constitutes “similar” varies by application
Two main approaches: Random feature extraction (ROCKET) provides fast, training-free embeddings suitable for classification and similarity. Learned temporal encoders (LSTMs, Temporal CNNs, Transformers) require training data but can capture more complex patterns.
10.6.1 ROCKET: Random Convolutional Kernels
Time-series data—sensor readings, stock prices, network traffic—require specialized embeddings that capture temporal patterns. ROCKET (RandOm Convolutional KErnel Transform) is a surprisingly effective approach: apply thousands of random convolutional kernels and extract simple statistics.
How ROCKET works:
Generate random kernels: Create kernels with random weights, random lengths (3-9), and random dilations (1, 2, 4)
Apply convolution: Convolve each kernel with the time-series
Extract features: For each convolution output, compute max value and proportion of positive values (PPV)
Result: Fixed-length embedding (e.g., 10,000 features from 5,000 kernels × 2 statistics)
Why random kernels work: Different kernels capture different patterns—oscillations, trends, sudden changes. With enough random kernels, you’ll capture the patterns that matter. No training required—just apply and extract features.
When to use ROCKET: Fast time-series classification, anomaly detection, or similarity search when you don’t have labeled data to train a neural network. For more complex patterns or when you have labels, consider learned temporal models (LSTMs, Temporal CNNs).
"""ROCKET-Style Time-Series EmbeddingsUses random convolutional kernels to extract features from time-series.Fast to compute, works well for classification and similarity."""import numpy as npdef generate_random_kernels(n_kernels: int=100, max_length: int=9) ->list:"""Generate random convolutional kernels.""" np.random.seed(42) kernels = []for _ inrange(n_kernels): length = np.random.choice([3, 5, 7, 9]) weights = np.random.randn(length) bias = np.random.randn() dilation = np.random.choice([1, 2, 4]) kernels.append((weights, bias, dilation))return kernelsdef apply_kernel(series: np.ndarray, kernel: tuple) ->tuple:"""Apply a single kernel and extract features (max, ppv).""" weights, bias, dilation = kernel length =len(weights)# Dilated convolution output = []for i inrange(len(series) - (length -1) * dilation): indices = [i + j * dilation for j inrange(length)] value = np.dot(series[indices], weights) + bias output.append(value) output = np.array(output)# ROCKET features: max value and proportion of positive values (PPV) max_val = np.max(output) iflen(output) >0else0 ppv = np.mean(output >0) iflen(output) >0else0return max_val, ppvdef rocket_embedding(series: np.ndarray, kernels: list) -> np.ndarray:"""Create ROCKET embedding from time-series.""" features = []for kernel in kernels: max_val, ppv = apply_kernel(series, kernel) features.extend([max_val, ppv])return np.array(features)# Generate kernels (done once)kernels = generate_random_kernels(n_kernels=50)# Example time-series patternst = np.linspace(0, 4*np.pi, 100)patterns = {'sine': np.sin(t) + np.random.randn(100) *0.1,'cosine': np.cos(t) + np.random.randn(100) *0.1,'trend_up': t/10+ np.random.randn(100) *0.2,'random': np.random.randn(100),}# Create embeddingsembeddings = {name: rocket_embedding(series, kernels)for name, series in patterns.items()}print(f"ROCKET embedding dimension: {len(embeddings['sine'])}")print(f" ({len(kernels)} kernels × 2 features each)")# Compare patternsfrom sklearn.metrics.pairwise import cosine_similarityprint("\nPattern similarities:")print(f" sine ↔ cosine: {cosine_similarity([embeddings['sine']], [embeddings['cosine']])[0][0]:.3f}")print(f" sine ↔ trend: {cosine_similarity([embeddings['sine']], [embeddings['trend_up']])[0][0]:.3f}")print(f" sine ↔ random: {cosine_similarity([embeddings['sine']], [embeddings['random']])[0][0]:.3f}")
ROCKET embedding dimension: 100
(50 kernels × 2 features each)
Pattern similarities:
sine ↔ cosine: 0.998
sine ↔ trend: 0.828
sine ↔ random: 0.894
10.6.2 Learned Temporal Embeddings
For more complex patterns beyond ROCKET’s random features, production systems use neural architectures like LSTMs, Transformers, or Temporal CNNs to learn time-series representations. Libraries like tsai, sktime, and darts provide pre-built architectures for time-series embedding.
10.7 Binary and Quantized Embeddings
At billion-vector scale, storage and memory become critical bottlenecks. A float32 embedding of 768 dimensions requires 3KB per vector—1 billion vectors need 3TB of storage. Quantization compresses embeddings while preserving much of their semantic structure.
What are quantized embeddings?
Binary embeddings: Reduce each dimension to 1 bit (sign only), achieving 32x compression
Product quantization (PQ): Learn codebooks to compress subvectors, typically 8-16x compression
Scalar quantization: Convert float32 → int8, achieving 4x compression with minimal quality loss
Why quantization works: Embedding similarity is robust to small perturbations. You don’t need full float32 precision to determine that two vectors are similar—the coarse structure (sign patterns, quantized values) captures most semantic information.
Trade-offs: Binary embeddings lose ~10-20% recall compared to float32. Product quantization is more accurate but slower. The sweet spot: use quantized embeddings for first-stage retrieval (fast, approximate), then rerank top-k results with full precision.
"""Binary and Quantized EmbeddingsCompress embeddings for efficiency:- Binary: Each dimension → 1 bit (32x compression)- Product Quantization: Learn codebooks for compression"""import numpy as npdef binarize_embedding(embedding: np.ndarray) -> np.ndarray:"""Convert to binary embedding (sign of each dimension)."""return (embedding >0).astype(np.int8)def hamming_distance(bin1: np.ndarray, bin2: np.ndarray) ->int:"""Hamming distance between binary vectors."""return np.sum(bin1 != bin2)def hamming_similarity(bin1: np.ndarray, bin2: np.ndarray) ->float:"""Normalized Hamming similarity (0 to 1)."""return1- hamming_distance(bin1, bin2) /len(bin1)# Example: Compare binary vs float embeddingsnp.random.seed(42)emb1 = np.random.randn(768)emb2 = emb1 + np.random.randn(768) *0.5# Similaremb3 = np.random.randn(768) # Different# Float similarityfrom sklearn.metrics.pairwise import cosine_similarityfloat_sim_12 = cosine_similarity([emb1], [emb2])[0][0]float_sim_13 = cosine_similarity([emb1], [emb3])[0][0]# Binary similaritybin1, bin2, bin3 = [binarize_embedding(e) for e in [emb1, emb2, emb3]]bin_sim_12 = hamming_similarity(bin1, bin2)bin_sim_13 = hamming_similarity(bin1, bin3)print("Float vs Binary similarity comparison:")print(f"\n Similar pair:")print(f" Float cosine: {float_sim_12:.3f}")print(f" Binary Hamming: {bin_sim_12:.3f}")print(f"\n Different pair:")print(f" Float cosine: {float_sim_13:.3f}")print(f" Binary Hamming: {bin_sim_13:.3f}")print(f"\nStorage comparison for 768-dim embedding:")print(f" Float32: {768*4} bytes")print(f" Binary: {768//8} bytes ({768*4/ (768//8):.0f}x compression)")
Float vs Binary similarity comparison:
Similar pair:
Float cosine: 0.894
Binary Hamming: 0.859
Different pair:
Float cosine: -0.016
Binary Hamming: 0.504
Storage comparison for 768-dim embedding:
Float32: 3072 bytes
Binary: 96 bytes (32x compression)
When to use quantized embeddings:
Billions of vectors (storage constraints)
Latency-critical applications
First-stage retrieval (rerank with full precision)
Edge deployment
10.8 Session and Behavioral Embeddings
User behavior unfolds as sequences of actions over time: browsing products, adding to cart, searching, checking out. These sequences contain valuable signals—“browse many, buy one” looks different from “search, add to cart, checkout.” Session embeddings capture these behavioral patterns as fixed-length vectors.
Why embed sessions:
Recommendation: Find users with similar browsing patterns to suggest relevant products
Intent prediction: Predict whether a session will end in purchase, cart abandonment, or bounce
User segmentation: Cluster users by behavioral patterns for targeted campaigns
The challenge: Sessions have variable length (3 actions vs 30 actions), temporal dependencies matter (order of actions carries meaning), and rare action combinations need to generalize. Simple bag-of-actions loses temporal structure; pure sequence models (LSTMs) can overfit.
Approach:
Learn embeddings for atomic actions (clicks, views, purchases)
Combine action sequences using:
Weighted averaging (recent actions weighted more heavily)
RNN/LSTM encoding for temporal dependencies
Transformer self-attention for long sequences
Use session embeddings for recommendations, anomaly detection, or user segmentation
Libraries:
Merlin (NVIDIA), RecBole, and session-based recommendation frameworks provide implementations.
Training session embeddings: Action embeddings are the foundation of session embeddings and must be learned from your behavioral data. Unlike text or image embeddings where pre-trained models transfer well, behavioral patterns are highly platform-specific.
Training approach:
Action embeddings: Start with nn.Embedding for each action type (view_product, add_to_cart, etc.). Initialize randomly with dimension 32-128 based on vocabulary size.
Training objectives: Multiple objectives work better than one:
Next-action prediction: Given first N actions, predict action N+1 (like language modeling)
Session outcome prediction: Predict whether session ends in purchase, bounce, or cart abandonment
Session-session similarity: Contrastive loss—sessions from same user should be similar, sessions from different user behaviors (browsers vs. buyers) should differ
Sequence encoding: Choose based on your data:
Weighted average: Fast, works for short sessions (3-10 actions), recent actions weighted more
LSTM/GRU: Captures temporal dependencies, good for medium sessions (10-50 actions)
Transformer: Best for long sessions (50+ actions) with complex dependencies, but requires more data
Cold start handling: New actions not seen during training need embeddings—use mean of existing embeddings or train a separate encoder that maps action features (category, price tier) to embeddings
Data requirements: 100K-1M sessions for general behavioral modeling, 10K-100K if you have strong labels (purchases, conversions). Include negative examples (bounces, abandoned carts) to learn what not to recommend.
See Chapter 14 for multi-objective training strategies and Chapter 20 for handling large action vocabularies and user bases.
10.9 Domain-Specific Embeddings
Some domains require specialized embedding approaches.
10.9.1 Security Log Embeddings
Security logs exemplify the challenge of multi-modal, structured data. A single log event contains semantic text (“Failed login attempt”), categorical metadata (event type, severity), numerical metrics (bytes transferred, duration), and domain-specific features (IP addresses, ports). Effective security log embeddings must combine all of these meaningfully.
Why security logs need hybrid embeddings:
Semantic similarity alone fails: Two “login” events with different IPs may be unrelated (one internal, one external attack)
Structured features alone fail: Metadata without message text loses critical context
Scale mismatch: Text embeddings (384 dims), categorical (12 dims), numerical (3 dims), network (5 dims) have wildly different scales
The hybrid approach: This example demonstrates the weighted normalized concatenation pattern applied to a real-world use case. Each feature type gets its own embedding pipeline, then all are normalized and weighted before combination. Weights (50% text, 20% categorical, 15% numerical, 15% network) are hyperparameters tuned for your specific detection task.
Real-world extensions: Use actual text encoders (sentence-transformers), train categorical embeddings on your log corpus, add temporal features (time-of-day embeddings), and tune weights on labeled anomaly data.
Training security log embeddings: While the code above uses simulated components, production security log embeddings require training both the categorical embeddings and the combination weights on your security data.
End-to-end training approach:
Categorical embeddings: Train nn.Embedding layers for event_type, severity, and other categorical fields on your log corpus. Use either:
Supervised: Train on labeled anomaly detection task—embeddings learn to separate normal from malicious events
Self-supervised: Predict co-occurrence patterns—events that appear together in attack sequences get similar embeddings
Weight optimization: Make weights trainable parameters instead of fixed hyperparameters:
weights = nn.Parameter(torch.tensor([0.5, 0.2, 0.15, 0.15])) # text, cat, num, domainweights = F.softmax(weights, dim=0) # Ensure they sum to 1.0
The model learns optimal weights during training—for threat detection, numerical features may dominate; for compliance search, text may dominate.
Training objective: Choose based on your use case:
Anomaly detection: Contrastive loss—normal logs cluster together, anomalies far from normal cluster
Threat hunting: Triplet loss—logs from same attack campaign close together, different campaigns far apart
Alert triage: Classification loss—predict severity, alert priority, or true positive vs. false positive
Multi-task learning: Train simultaneously on multiple objectives (anomaly detection + severity prediction + campaign clustering) with weighted loss combination. This prevents overfitting to one task.
Data requirements: 10K-100K labeled logs for supervised training (rare events, known attacks), or 100K-1M unlabeled logs for self-supervised approaches. Include temporal features if training for time-series anomaly detection.
Domain specialization: Pre-train on general security logs (OCSF corpus, public datasets) then fine-tune on your environment. This adapts embeddings to your specific threat landscape—healthcare sees different attacks than financial services.
See Chapter 14 for multi-objective training strategies and guidance on when domain-specific training justifies the cost vs. using off-the-shelf text embeddings with simple feature concatenation.
10.10 Choosing the Right Pattern
Advanced embedding pattern selection guide
Pattern
Best For
Trade-offs
Hybrid vectors
Multi-faceted entities (logs, products)
Requires weight tuning
Multi-vector (ColBERT)
Fine-grained matching
10-100x storage
Matryoshka
Variable quality/latency needs
Requires special training
Learned sparse (SPLADE)
Interpretability + performance
More complex indexing
ROCKET time-series
Pattern similarity
Fixed representation
Binary/quantized
Massive scale
Quality loss
Session embeddings
Behavioral patterns
Requires sequence modeling
10.11 Key Takeaways
Naive concatenation fails when combining embeddings of different sizes—use weighted, normalized concatenation
Entity embeddings for categorical features outperform one-hot encoding by learning relationships between categories
Multi-vector representations (ColBERT) provide fine-grained matching at the cost of storage
Matryoshka embeddings enable quality/latency trade-offs at query time
Learned sparse embeddings (SPLADE) combine interpretability with semantic matching
Time-series patterns can be captured with ROCKET (fast, simple) or learned encoders (more expressive)
Domain-specific embeddings like security logs require thoughtful combination of semantic, categorical, numerical, and specialized features
10.12 Looking Ahead
This completes Part II on embedding types. Chapter 11 begins Part III: Core Applications, showing how to build retrieval-augmented generation systems that put these embeddings to work. For training custom embeddings with these patterns, Chapter 14 in Part IV provides guidance on when to build versus fine-tune.
10.13 Further Reading
Khattab, O. & Zaharia, M. (2020). “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR
Kusupati, A., et al. (2022). “Matryoshka Representation Learning.” NeurIPS
Formal, T., et al. (2021). “SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.” SIGIR
Dempster, A., et al. (2020). “ROCKET: Exceptionally Fast and Accurate Time Series Classification Using Random Convolutional Kernels.” Data Mining and Knowledge Discovery
Guo, C., et al. (2016). “Entity Embeddings of Categorical Variables.” arXiv:1604.06737