10  Advanced Embedding Types

NoteChapter Overview

Production embedding systems rarely use single, off-the-shelf embeddings. This chapter covers the advanced patterns that power real-world systems: hybrid vectors combining multiple feature types, multi-vector representations for fine-grained matching, learned sparse embeddings for interpretability, and domain-specific patterns for security, time-series, and structured data. These patterns build on the foundational types covered in Chapters 4-9.

10.1 Beyond Single Embeddings

The foundational embedding types—text, image, audio, and others—serve as building blocks. Production systems combine, extend, and specialize these foundations in sophisticated ways:

  • Hybrid embeddings combine semantic, categorical, numerical, and domain-specific features
  • Multi-vector representations use multiple embeddings per item for fine-grained matching
  • Learned sparse embeddings balance dense semantics with interpretable sparse features
  • Specialized architectures optimize for specific retrieval patterns

Understanding these patterns is essential for building embedding systems that perform well on real-world data.

10.2 Hybrid and Composite Embeddings

Real-world entities have multiple facets that single embeddings can’t capture. A security log has semantic content (message text), categorical features (event type, severity), numerical features (byte counts, durations), and domain-specific features (IP addresses). Hybrid embeddings combine all of these.

10.2.1 The Naive Approach Fails

Simple concatenation doesn’t work:

"""
Why Naive Concatenation Fails

When combining embeddings of different dimensions, larger vectors
dominate similarity calculations, drowning out smaller features.
"""

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

np.random.seed(42)

# Simulate: 384-dim text embedding + 10-dim numerical features
text_embedding = np.random.randn(384)
numerical_features = np.array([0.5, 0.8, 0.2, 0.1, 0.9, 0.3, 0.7, 0.4, 0.6, 0.5])

# Naive concatenation
naive_hybrid = np.concatenate([text_embedding, numerical_features])

# The problem: text embedding dominates
text_magnitude = np.linalg.norm(text_embedding)
num_magnitude = np.linalg.norm(numerical_features)

print("Magnitude comparison:")
print(f"  Text embedding (384 dims):     {text_magnitude:.2f}")
print(f"  Numerical features (10 dims):  {num_magnitude:.2f}")
print(f"  Ratio: {text_magnitude/num_magnitude:.1f}x")
print("\nThe text embedding will dominate similarity calculations!")
Magnitude comparison:
  Text embedding (384 dims):     18.67
  Numerical features (10 dims):  1.76
  Ratio: 10.6x

The text embedding will dominate similarity calculations!

10.2.2 Weighted Normalized Concatenation

The solution: normalize each component, then apply importance weights:

"""
Weighted Normalized Concatenation

Properly combines multiple feature types by:
1. L2-normalizing each component independently
2. Applying learned or tuned weights
3. Concatenating the weighted, normalized components
"""

import numpy as np
from sklearn.preprocessing import normalize

np.random.seed(42)

def create_hybrid_embedding(
    text_embedding: np.ndarray,
    categorical_embedding: np.ndarray,
    numerical_features: np.ndarray,
    domain_features: np.ndarray,
    weights: dict
) -> np.ndarray:
    """
    Create a hybrid embedding from multiple feature types.

    Args:
        text_embedding: Semantic embedding from text encoder (e.g., 384 dims)
        categorical_embedding: Learned embeddings for categorical features
        numerical_features: Scaled numerical features
        domain_features: Domain-specific features (e.g., IP encoding)
        weights: Importance weights for each component (should sum to 1.0)

    Returns:
        Hybrid embedding vector
    """
    # L2-normalize each component
    text_norm = normalize(text_embedding.reshape(1, -1))[0]
    cat_norm = normalize(categorical_embedding.reshape(1, -1))[0]
    num_norm = normalize(numerical_features.reshape(1, -1))[0]
    domain_norm = normalize(domain_features.reshape(1, -1))[0]

    # Apply weights and concatenate
    hybrid = np.concatenate([
        text_norm * weights['text'],
        cat_norm * weights['categorical'],
        num_norm * weights['numerical'],
        domain_norm * weights['domain']
    ])

    return hybrid

# Example: Security log embedding
text_emb = np.random.randn(384)  # From sentence transformer
cat_emb = np.random.randn(32)    # Learned embeddings for event_type, severity
num_feat = np.random.randn(10)   # Scaled: bytes_in, bytes_out, duration
domain_feat = np.array([0.75, 0.65, 0.003, 0.039, 1.0])  # IP octets + is_private

# Weights are hyperparameters to tune
weights = {
    'text': 0.50,        # Semantic content is most important
    'categorical': 0.20, # Event type matters
    'numerical': 0.15,   # Metrics provide context
    'domain': 0.15       # IP information for security
}

hybrid = create_hybrid_embedding(
    text_emb, cat_emb, num_feat, domain_feat, weights
)

print(f"Hybrid embedding dimension: {len(hybrid)}")
print(f"  Text component: 384 dims × {weights['text']} weight")
print(f"  Categorical: 32 dims × {weights['categorical']} weight")
print(f"  Numerical: 10 dims × {weights['numerical']} weight")
print(f"  Domain: 5 dims × {weights['domain']} weight")
Hybrid embedding dimension: 431
  Text component: 384 dims × 0.5 weight
  Categorical: 32 dims × 0.2 weight
  Numerical: 10 dims × 0.15 weight
  Domain: 5 dims × 0.15 weight

Tuning hybrid embedding weights: The weights (0.50 text, 0.20 categorical, 0.15 numerical, 0.15 domain) are critical hyperparameters that determine the final embedding’s behavior. Don’t use equal weights—they ignore the relative importance of each feature type.

Three approaches to finding optimal weights:

  1. Grid search: Try different weight combinations on a validation set measuring your downstream task (classification accuracy, retrieval recall, etc.). Start coarse (0.1 increments) then refine around the best region.

  2. Learned weights: Make weights trainable parameters in your model. Initialize near your intuition (0.5, 0.2, 0.15, 0.15), then let backpropagation optimize them. Add a softmax constraint to ensure they sum to 1.0.

  3. Task-specific tuning: For anomaly detection, boost numerical features (bytes, durations often reveal attacks). For semantic search, boost text. For compliance filtering, boost categorical (severity, event type).

Validation is essential: Measure downstream task performance, not just embedding similarity. A hybrid embedding optimized for retrieval recall may differ from one optimized for classification accuracy. See Chapter 14 for multi-objective optimization strategies.

10.2.3 Entity Embeddings for Categorical Features

Categorical features like “event type” or “product category” are often encoded as one-hot vectors—sparse, high-dimensional, and unable to capture relationships. Entity embeddings offer a better approach: learn dense, low-dimensional representations where similar categories have similar embeddings.

Why entity embeddings work better than one-hot:

  • Dimensionality: 7 categories → 7-dim one-hot vs 8-dim learned embedding (similar size, but embedding is dense)
  • Relationships: Captures that “login” and “logout” are related, while one-hot treats all categories as equally distant
  • Generalization: Rare categories benefit from learned structure
  • Integration: Learned embeddings combine smoothly with other features in hybrid embeddings

How it works: Each categorical value gets an embedding vector (trainable parameters). During training, the model learns to position similar categories near each other in embedding space. In production, use PyTorch’s nn.Embedding or TensorFlow’s tf.keras.layers.Embedding.

Training entity embeddings: These embeddings must be learned from your data—they’re not pre-trained. You have two approaches:

  1. End-to-end training: Include nn.Embedding layers in your neural network and train on your downstream task (classification, ranking, etc.). The model learns to position similar categories nearby based on task performance.

  2. Pre-training with co-occurrence: If you have large unlabeled datasets, train embeddings by predicting co-occurrence patterns (e.g., which event types tend to appear together in sequences). This is analogous to Word2Vec but for categorical features.

Practical considerations:

  • Dimensionality: Start with min(50, num_categories // 2) as a heuristic, then tune as a hyperparameter
  • Regularization: Use dropout on embeddings to prevent overfitting rare categories
  • Cold start: For new categories not seen during training, use the mean embedding or a learned “unknown” embedding
  • Sharing: Categories that appear in similar contexts should share embeddings (e.g., event_type embeddings shared across all security products)

See Chapter 14 for detailed guidance on training categorical embeddings and Chapter 20 for handling large categorical vocabularies efficiently.

"""
Entity Embeddings for Categorical Features

Learn dense representations for categorical values instead of sparse one-hot.
This captures relationships between categories (e.g., similar event types).
"""

import numpy as np

# Simulated learned embeddings for categorical features
# In practice, use nn.Embedding in PyTorch/TensorFlow

class CategoryEmbedder:
    """Simple category embedder (production would use nn.Embedding)."""

    def __init__(self, categories: list, embedding_dim: int = 8):
        self.categories = {cat: i for i, cat in enumerate(categories)}
        self.embedding_dim = embedding_dim
        # Initialize random embeddings (would be learned in practice)
        np.random.seed(42)
        self.embeddings = np.random.randn(len(categories), embedding_dim) * 0.1

    def embed(self, category: str) -> np.ndarray:
        idx = self.categories.get(category, 0)
        return self.embeddings[idx]

# Example: Event type embeddings for security logs
event_types = ['login', 'logout', 'file_access', 'network_connection',
               'process_start', 'process_end', 'privilege_escalation']
severity_levels = ['info', 'warning', 'error', 'critical']

event_embedder = CategoryEmbedder(event_types, embedding_dim=8)
severity_embedder = CategoryEmbedder(severity_levels, embedding_dim=4)

# Embed categorical features
event_emb = event_embedder.embed('login')
severity_emb = severity_embedder.embed('warning')

# Combine into categorical embedding
categorical_embedding = np.concatenate([event_emb, severity_emb])

print(f"Event embedding shape: {event_emb.shape}")
print(f"Severity embedding shape: {severity_emb.shape}")
print(f"Combined categorical embedding: {categorical_embedding.shape}")
Event embedding shape: (8,)
Severity embedding shape: (4,)
Combined categorical embedding: (12,)

10.2.4 Numerical Feature Preprocessing

Numerical features need careful preprocessing before embedding. Raw numerical values often span wildly different scales (bytes in millions, durations in milliseconds) and follow long-tail distributions. Without preprocessing, large-scale features dominate similarity calculations—the same problem as naive concatenation. Proper preprocessing ensures each feature contributes proportionally:

"""
Numerical Feature Preprocessing Pipeline

Proper preprocessing for numerical features:
1. Handle missing values
2. Apply log transform for long-tail distributions
3. Standardize to zero mean, unit variance
4. L2-normalize the result
"""

import numpy as np
from sklearn.preprocessing import StandardScaler

class NumericalPreprocessor:
    """Preprocess numerical features for embedding."""

    def __init__(self, feature_names: list):
        self.feature_names = feature_names
        self.scaler = StandardScaler()
        self.fitted = False

    def fit(self, data: np.ndarray):
        """Fit the scaler on training data."""
        # Apply log1p for long-tail features (bytes, counts)
        log_data = np.log1p(np.clip(data, 0, None))
        self.scaler.fit(log_data)
        self.fitted = True
        return self

    def transform(self, data: np.ndarray) -> np.ndarray:
        """Transform and normalize numerical features."""
        # Handle missing values
        data = np.nan_to_num(data, nan=0.0)

        # Log transform for long-tail distributions
        log_data = np.log1p(np.clip(data, 0, None))

        # Standardize
        if self.fitted:
            scaled = self.scaler.transform(log_data.reshape(1, -1))[0]
        else:
            scaled = log_data

        return scaled

# Example: Network metrics
feature_names = ['bytes_in', 'bytes_out', 'duration_ms', 'packet_count']
preprocessor = NumericalPreprocessor(feature_names)

# Simulate training data for fitting
train_data = np.array([
    [1024, 2048, 150, 10],
    [1000000, 500000, 5000, 1000],  # Long-tail values
    [512, 1024, 50, 5],
])
preprocessor.fit(train_data)

# Transform new data point
new_data = np.array([50000, 25000, 200, 50])
processed = preprocessor.transform(new_data)

print("Original features:", new_data)
print("Processed features:", np.round(processed, 3))
Original features: [50000 25000   200    50]
Processed features: [ 0.533  0.325 -0.265  0.102]

10.3 Multi-Vector Representations

Single-vector embeddings face a fundamental trade-off: compress an entire document into one fixed-length vector, inevitably losing detail. For long documents or when precise phrase matching matters, this compression is too lossy. Multi-vector representations solve this by using multiple embeddings per item.

The key insight: Instead of one 768-dim vector per document, use N vectors of 128 dims (one per token or sentence). Matching happens at the token level—find which query tokens match which document tokens. This preserves fine-grained information at the cost of 10-100x storage.

10.3.1 ColBERT-Style Late Interaction

ColBERT (Contextualized Late Interaction over BERT) pioneered this approach for document retrieval. Instead of encoding a document into a single vector, ColBERT produces a matrix where each row is the contextualized embedding of a token.

How late interaction works:

  1. Encode: Pass query and document through BERT, producing one vector per token
  2. Index: Store all document token vectors (this is the storage cost)
  3. Retrieve: For each query token, find its max similarity to any document token
  4. Score: Sum these max similarities—this is the document’s relevance score

Why it works: Allows exact phrase matching (if query token “machine” matches document token “machine” strongly, that’s captured) while still using semantic embeddings. Much more accurate than single-vector for long documents or when specific terms matter.

When to use multi-vector:

  • Fine-grained matching matters (exact phrase matching)
  • Documents are long and diverse
  • You can afford 10-100x storage overhead
  • Use libraries like colbert-ir or RAGatouille for production implementations

Fine-tuning ColBERT for your domain: While you can use pre-trained ColBERT models, domain-specific fine-tuning significantly improves accuracy. Legal documents, medical records, and code repositories have specialized vocabularies and matching patterns that generic models miss.

Training approach:

  1. Start with pre-trained ColBERT: Initialize from a model trained on MS MARCO or similar large corpus
  2. Gather domain data: Collect query-document pairs with relevance labels (clicks, ratings, or judgments)
  3. Contrastive training: For each query, train to score relevant documents higher than irrelevant ones
  4. Storage-quality trade-off: During fine-tuning, you can reduce embedding dimensions (128→64) to save storage, though this trades some accuracy

Fine-tuning typically requires 10K-100K labeled query-document pairs and 1-2 days on a single GPU. See Chapter 14 for guidance on when to fine-tune vs. use off-the-shelf models.

10.4 Matryoshka Embeddings

Traditional embeddings force a hard choice: use 384 dimensions (expensive storage, slow search) or 128 dimensions (cheaper but less accurate). Matryoshka embeddings eliminate this trade-off by encoding information hierarchically—the first N dimensions form a valid embedding for any N.

How they work: Models are trained with a special multi-scale loss function. During training, the loss is computed not just on the full 768-dim embedding, but also on prefixes (first 64, 128, 256, 384 dims). This forces the model to pack the most important information into early dimensions, with refinements in later dimensions.

The key property: You can truncate a 768-dim Matryoshka embedding to 128 dims and still get semantically meaningful results. Unlike simply training a 128-dim model (which might be more accurate at 128 dims), Matryoshka gives you flexibility—one model, multiple dimension options.

Benefits of Matryoshka embeddings:

  • Use short prefixes for fast initial retrieval
  • Use full dimensions for final reranking
  • Adapt to latency/quality requirements at runtime
  • Reduce storage by storing only needed dimensions

Available models:

  • Nomic AI’s nomic-embed-text-v1.5 (768→64 dims)
  • Voyage AI’s models support variable dimensions
  • Sentence Transformers with Matryoshka training

10.5 Learned Sparse Embeddings

Dense embeddings (768 floats) excel at semantic matching but lack interpretability and require specialized vector databases. Sparse retrieval (BM25) is interpretable and uses standard inverted indices, but misses semantic relationships. Learned sparse embeddings like SPLADE combine both: use transformers to create sparse vectors with interpretable dimensions.

The innovation: Instead of encoding text into a dense 768-dim vector, predict an importance weight for each vocabulary term (30,000 terms). Most weights are zero—you get a sparse vector with 100-200 non-zero entries. Dimensions correspond to actual words, so you can see which terms matter.

Why this is powerful:

  • Semantic expansion: The model learns to activate related terms not in the original text. Query “ML models” activates dimensions for “machine learning,” “neural networks,” “deep learning”
  • Interpretability: You can inspect which vocabulary terms fired and why
  • Standard indexing: Sparse vectors work with inverted indices—no need for specialized HNSW or IVF indexes
  • Hybrid search: Combine with dense embeddings for best of both worlds

How it differs from BM25: BM25 activates exact term matches. SPLADE uses a transformer to learn which related terms should activate and with what weights. This captures semantics while maintaining sparsity.

How it works:

  1. Pass text through a transformer encoder
  2. For each vocabulary term, predict an importance weight
  3. Result is a sparse vector (typically 100-200 non-zero terms out of 30K vocabulary)
  4. Can be indexed with inverted indices for efficient retrieval

Benefits of learned sparse:

  • Interpretable (dimensions correspond to vocabulary terms)
  • Works with inverted indices (fast exact matching)
  • Captures term expansion (related terms automatically included)
  • Combines well with dense embeddings (hybrid search)

Available implementations:

  • Primal library for SPLADE models
  • Pinecone and Qdrant support hybrid sparse+dense search

Training SPLADE models: Unlike off-the-shelf dense embeddings, SPLADE models require training on your domain to learn effective term expansion patterns. The model must learn which vocabulary terms are semantically related and should co-activate.

Training approach:

  1. Architecture: Start with a pre-trained BERT/RoBERTa encoder, add a vocabulary projection layer (768 hidden dims → 30K vocab terms) with ReLU activation and log saturation to enforce sparsity

  2. Loss function: Use contrastive loss over query-document pairs—maximize scores for relevant pairs, minimize for irrelevant pairs. Add FLOPS regularization to penalize excessive term activation (controls sparsity)

  3. Data requirements: Need query-document pairs with relevance judgments. 100K-1M pairs for general domain, 10K-100K for specialized domains where pre-training helps

  4. Sparsity trade-off: FLOPS regularization hyperparameter controls sparsity vs. quality. High regularization → 50-100 active terms (fast, interpretable). Low regularization → 200-300 active terms (more accurate, less sparse)

Domain adaptation: Can fine-tune existing SPLADE models (trained on MS MARCO) for your domain with 10K-50K domain-specific pairs. This adapts term expansion patterns—medical SPLADE learns “MI” expands to “myocardial infarction”, “heart attack”, etc.

See Chapter 14 for guidance on training sparse models and Chapter 20 for handling large vocabularies efficiently.

10.6 Time-Series Pattern Embeddings

Time-series data presents unique challenges for embeddings. Unlike text or images where pre-trained models excel, time-series patterns are highly domain-specific—a “normal” pattern in network traffic looks nothing like a “normal” pattern in heart rate data. Effective time-series embeddings must capture temporal structure: trends, seasonality, sudden changes, and oscillations.

What makes time-series different:

  • Variable length: Sensor readings might have 100 or 10,000 time steps
  • Temporal dependencies: The order matters—shuffling time steps destroys meaning
  • Scale sensitivity: Amplitude, frequency, and phase all carry information
  • Domain-specific patterns: What constitutes “similar” varies by application

Two main approaches: Random feature extraction (ROCKET) provides fast, training-free embeddings suitable for classification and similarity. Learned temporal encoders (LSTMs, Temporal CNNs, Transformers) require training data but can capture more complex patterns.

10.6.1 ROCKET: Random Convolutional Kernels

Time-series data—sensor readings, stock prices, network traffic—require specialized embeddings that capture temporal patterns. ROCKET (RandOm Convolutional KErnel Transform) is a surprisingly effective approach: apply thousands of random convolutional kernels and extract simple statistics.

How ROCKET works:

  1. Generate random kernels: Create kernels with random weights, random lengths (3-9), and random dilations (1, 2, 4)
  2. Apply convolution: Convolve each kernel with the time-series
  3. Extract features: For each convolution output, compute max value and proportion of positive values (PPV)
  4. Result: Fixed-length embedding (e.g., 10,000 features from 5,000 kernels × 2 statistics)

Why random kernels work: Different kernels capture different patterns—oscillations, trends, sudden changes. With enough random kernels, you’ll capture the patterns that matter. No training required—just apply and extract features.

When to use ROCKET: Fast time-series classification, anomaly detection, or similarity search when you don’t have labeled data to train a neural network. For more complex patterns or when you have labels, consider learned temporal models (LSTMs, Temporal CNNs).

"""
ROCKET-Style Time-Series Embeddings

Uses random convolutional kernels to extract features from time-series.
Fast to compute, works well for classification and similarity.
"""

import numpy as np

def generate_random_kernels(n_kernels: int = 100, max_length: int = 9) -> list:
    """Generate random convolutional kernels."""
    np.random.seed(42)
    kernels = []
    for _ in range(n_kernels):
        length = np.random.choice([3, 5, 7, 9])
        weights = np.random.randn(length)
        bias = np.random.randn()
        dilation = np.random.choice([1, 2, 4])
        kernels.append((weights, bias, dilation))
    return kernels

def apply_kernel(series: np.ndarray, kernel: tuple) -> tuple:
    """Apply a single kernel and extract features (max, ppv)."""
    weights, bias, dilation = kernel
    length = len(weights)

    # Dilated convolution
    output = []
    for i in range(len(series) - (length - 1) * dilation):
        indices = [i + j * dilation for j in range(length)]
        value = np.dot(series[indices], weights) + bias
        output.append(value)

    output = np.array(output)

    # ROCKET features: max value and proportion of positive values (PPV)
    max_val = np.max(output) if len(output) > 0 else 0
    ppv = np.mean(output > 0) if len(output) > 0 else 0

    return max_val, ppv

def rocket_embedding(series: np.ndarray, kernels: list) -> np.ndarray:
    """Create ROCKET embedding from time-series."""
    features = []
    for kernel in kernels:
        max_val, ppv = apply_kernel(series, kernel)
        features.extend([max_val, ppv])
    return np.array(features)

# Generate kernels (done once)
kernels = generate_random_kernels(n_kernels=50)

# Example time-series patterns
t = np.linspace(0, 4*np.pi, 100)
patterns = {
    'sine': np.sin(t) + np.random.randn(100) * 0.1,
    'cosine': np.cos(t) + np.random.randn(100) * 0.1,
    'trend_up': t/10 + np.random.randn(100) * 0.2,
    'random': np.random.randn(100),
}

# Create embeddings
embeddings = {name: rocket_embedding(series, kernels)
              for name, series in patterns.items()}

print(f"ROCKET embedding dimension: {len(embeddings['sine'])}")
print(f"  ({len(kernels)} kernels × 2 features each)")

# Compare patterns
from sklearn.metrics.pairwise import cosine_similarity
print("\nPattern similarities:")
print(f"  sine ↔ cosine: {cosine_similarity([embeddings['sine']], [embeddings['cosine']])[0][0]:.3f}")
print(f"  sine ↔ trend:  {cosine_similarity([embeddings['sine']], [embeddings['trend_up']])[0][0]:.3f}")
print(f"  sine ↔ random: {cosine_similarity([embeddings['sine']], [embeddings['random']])[0][0]:.3f}")
ROCKET embedding dimension: 100
  (50 kernels × 2 features each)

Pattern similarities:
  sine ↔ cosine: 0.998
  sine ↔ trend:  0.828
  sine ↔ random: 0.894

10.6.2 Learned Temporal Embeddings

For more complex patterns beyond ROCKET’s random features, production systems use neural architectures like LSTMs, Transformers, or Temporal CNNs to learn time-series representations. Libraries like tsai, sktime, and darts provide pre-built architectures for time-series embedding.

10.7 Binary and Quantized Embeddings

At billion-vector scale, storage and memory become critical bottlenecks. A float32 embedding of 768 dimensions requires 3KB per vector—1 billion vectors need 3TB of storage. Quantization compresses embeddings while preserving much of their semantic structure.

What are quantized embeddings?

  • Binary embeddings: Reduce each dimension to 1 bit (sign only), achieving 32x compression
  • Product quantization (PQ): Learn codebooks to compress subvectors, typically 8-16x compression
  • Scalar quantization: Convert float32 → int8, achieving 4x compression with minimal quality loss

Why quantization works: Embedding similarity is robust to small perturbations. You don’t need full float32 precision to determine that two vectors are similar—the coarse structure (sign patterns, quantized values) captures most semantic information.

Trade-offs: Binary embeddings lose ~10-20% recall compared to float32. Product quantization is more accurate but slower. The sweet spot: use quantized embeddings for first-stage retrieval (fast, approximate), then rerank top-k results with full precision.

"""
Binary and Quantized Embeddings

Compress embeddings for efficiency:
- Binary: Each dimension → 1 bit (32x compression)
- Product Quantization: Learn codebooks for compression
"""

import numpy as np

def binarize_embedding(embedding: np.ndarray) -> np.ndarray:
    """Convert to binary embedding (sign of each dimension)."""
    return (embedding > 0).astype(np.int8)

def hamming_distance(bin1: np.ndarray, bin2: np.ndarray) -> int:
    """Hamming distance between binary vectors."""
    return np.sum(bin1 != bin2)

def hamming_similarity(bin1: np.ndarray, bin2: np.ndarray) -> float:
    """Normalized Hamming similarity (0 to 1)."""
    return 1 - hamming_distance(bin1, bin2) / len(bin1)

# Example: Compare binary vs float embeddings
np.random.seed(42)
emb1 = np.random.randn(768)
emb2 = emb1 + np.random.randn(768) * 0.5  # Similar
emb3 = np.random.randn(768)  # Different

# Float similarity
from sklearn.metrics.pairwise import cosine_similarity
float_sim_12 = cosine_similarity([emb1], [emb2])[0][0]
float_sim_13 = cosine_similarity([emb1], [emb3])[0][0]

# Binary similarity
bin1, bin2, bin3 = [binarize_embedding(e) for e in [emb1, emb2, emb3]]
bin_sim_12 = hamming_similarity(bin1, bin2)
bin_sim_13 = hamming_similarity(bin1, bin3)

print("Float vs Binary similarity comparison:")
print(f"\n  Similar pair:")
print(f"    Float cosine: {float_sim_12:.3f}")
print(f"    Binary Hamming: {bin_sim_12:.3f}")
print(f"\n  Different pair:")
print(f"    Float cosine: {float_sim_13:.3f}")
print(f"    Binary Hamming: {bin_sim_13:.3f}")

print(f"\nStorage comparison for 768-dim embedding:")
print(f"  Float32: {768 * 4} bytes")
print(f"  Binary:  {768 // 8} bytes ({768 * 4 / (768 // 8):.0f}x compression)")
Float vs Binary similarity comparison:

  Similar pair:
    Float cosine: 0.894
    Binary Hamming: 0.859

  Different pair:
    Float cosine: -0.016
    Binary Hamming: 0.504

Storage comparison for 768-dim embedding:
  Float32: 3072 bytes
  Binary:  96 bytes (32x compression)

When to use quantized embeddings:

  • Billions of vectors (storage constraints)
  • Latency-critical applications
  • First-stage retrieval (rerank with full precision)
  • Edge deployment

10.8 Session and Behavioral Embeddings

User behavior unfolds as sequences of actions over time: browsing products, adding to cart, searching, checking out. These sequences contain valuable signals—“browse many, buy one” looks different from “search, add to cart, checkout.” Session embeddings capture these behavioral patterns as fixed-length vectors.

Why embed sessions:

  • Recommendation: Find users with similar browsing patterns to suggest relevant products
  • Anomaly detection: Detect unusual behavioral sequences (potential fraud, bots)
  • Intent prediction: Predict whether a session will end in purchase, cart abandonment, or bounce
  • User segmentation: Cluster users by behavioral patterns for targeted campaigns

The challenge: Sessions have variable length (3 actions vs 30 actions), temporal dependencies matter (order of actions carries meaning), and rare action combinations need to generalize. Simple bag-of-actions loses temporal structure; pure sequence models (LSTMs) can overfit.

Approach:

  1. Learn embeddings for atomic actions (clicks, views, purchases)
  2. Combine action sequences using:
    • Weighted averaging (recent actions weighted more heavily)
    • RNN/LSTM encoding for temporal dependencies
    • Transformer self-attention for long sequences
  3. Use session embeddings for recommendations, anomaly detection, or user segmentation

Libraries:

Merlin (NVIDIA), RecBole, and session-based recommendation frameworks provide implementations.

Training session embeddings: Action embeddings are the foundation of session embeddings and must be learned from your behavioral data. Unlike text or image embeddings where pre-trained models transfer well, behavioral patterns are highly platform-specific.

Training approach:

  1. Action embeddings: Start with nn.Embedding for each action type (view_product, add_to_cart, etc.). Initialize randomly with dimension 32-128 based on vocabulary size.

  2. Training objectives: Multiple objectives work better than one:

    • Next-action prediction: Given first N actions, predict action N+1 (like language modeling)
    • Session outcome prediction: Predict whether session ends in purchase, bounce, or cart abandonment
    • Session-session similarity: Contrastive loss—sessions from same user should be similar, sessions from different user behaviors (browsers vs. buyers) should differ
  3. Sequence encoding: Choose based on your data:

    • Weighted average: Fast, works for short sessions (3-10 actions), recent actions weighted more
    • LSTM/GRU: Captures temporal dependencies, good for medium sessions (10-50 actions)
    • Transformer: Best for long sessions (50+ actions) with complex dependencies, but requires more data
  4. Cold start handling: New actions not seen during training need embeddings—use mean of existing embeddings or train a separate encoder that maps action features (category, price tier) to embeddings

Data requirements: 100K-1M sessions for general behavioral modeling, 10K-100K if you have strong labels (purchases, conversions). Include negative examples (bounces, abandoned carts) to learn what not to recommend.

See Chapter 14 for multi-objective training strategies and Chapter 20 for handling large action vocabularies and user bases.

10.9 Domain-Specific Embeddings

Some domains require specialized embedding approaches.

10.9.1 Security Log Embeddings

Security logs exemplify the challenge of multi-modal, structured data. A single log event contains semantic text (“Failed login attempt”), categorical metadata (event type, severity), numerical metrics (bytes transferred, duration), and domain-specific features (IP addresses, ports). Effective security log embeddings must combine all of these meaningfully.

Why security logs need hybrid embeddings:

  • Semantic similarity alone fails: Two “login” events with different IPs may be unrelated (one internal, one external attack)
  • Structured features alone fail: Metadata without message text loses critical context
  • Scale mismatch: Text embeddings (384 dims), categorical (12 dims), numerical (3 dims), network (5 dims) have wildly different scales

The hybrid approach: This example demonstrates the weighted normalized concatenation pattern applied to a real-world use case. Each feature type gets its own embedding pipeline, then all are normalized and weighted before combination. Weights (50% text, 20% categorical, 15% numerical, 15% network) are hyperparameters tuned for your specific detection task.

Real-world extensions: Use actual text encoders (sentence-transformers), train categorical embeddings on your log corpus, add temporal features (time-of-day embeddings), and tune weights on labeled anomaly data.

"""
Security Log Embedding (OCSF-style)

Hybrid embedding for security events combining:
- Semantic: Log message content
- Categorical: Event type, severity, status
- Numerical: Byte counts, durations
- Network: IP address encoding
"""

import numpy as np
from sklearn.preprocessing import normalize

def encode_ip_address(ip: str) -> np.ndarray:
    """
    Encode IP address as 5-dim vector:
    - 4 normalized octets
    - 1 is_private indicator
    """
    try:
        octets = [int(x) for x in ip.split('.')]
        normalized = [o / 255.0 for o in octets]

        # Check if private IP
        is_private = (
            octets[0] == 10 or
            (octets[0] == 172 and 16 <= octets[1] <= 31) or
            (octets[0] == 192 and octets[1] == 168)
        )

        return np.array(normalized + [float(is_private)])
    except:
        return np.zeros(5)

class SecurityLogEmbedder:
    """Create hybrid embeddings for security logs."""

    def __init__(self):
        np.random.seed(42)
        # Simulated text encoder (would use sentence-transformers)
        self.text_dim = 384
        # Category embeddings
        self.event_types = ['login', 'logout', 'file_access', 'network', 'process']
        self.event_embeddings = np.random.randn(len(self.event_types), 8) * 0.1
        self.severities = ['info', 'warning', 'error', 'critical']
        self.severity_embeddings = np.random.randn(len(self.severities), 4) * 0.1

        # Weights for combining
        self.weights = {
            'text': 0.50,
            'categorical': 0.20,
            'numerical': 0.15,
            'network': 0.15
        }

    def embed(self, log: dict) -> np.ndarray:
        """Create hybrid embedding for a security log."""
        # Text embedding (simulated)
        np.random.seed(hash(log.get('message', '')) % 2**32)
        text_emb = np.random.randn(self.text_dim)

        # Categorical embeddings
        event_idx = self.event_types.index(log.get('event_type', 'network'))
        severity_idx = self.severities.index(log.get('severity', 'info'))
        cat_emb = np.concatenate([
            self.event_embeddings[event_idx],
            self.severity_embeddings[severity_idx]
        ])

        # Numerical features
        num_features = np.array([
            np.log1p(log.get('bytes_in', 0)),
            np.log1p(log.get('bytes_out', 0)),
            np.log1p(log.get('duration_ms', 0)),
        ])

        # Network features
        ip_emb = encode_ip_address(log.get('src_ip', '0.0.0.0'))

        # Normalize and weight
        text_norm = normalize(text_emb.reshape(1, -1))[0] * self.weights['text']
        cat_norm = normalize(cat_emb.reshape(1, -1))[0] * self.weights['categorical']
        num_norm = normalize(num_features.reshape(1, -1))[0] * self.weights['numerical']
        ip_norm = normalize(ip_emb.reshape(1, -1))[0] * self.weights['network']

        return np.concatenate([text_norm, cat_norm, num_norm, ip_norm])

# Example
embedder = SecurityLogEmbedder()

log1 = {
    'message': 'Failed login attempt from external IP',
    'event_type': 'login',
    'severity': 'warning',
    'bytes_in': 1024,
    'bytes_out': 512,
    'duration_ms': 150,
    'src_ip': '203.0.113.50'
}

log2 = {
    'message': 'Successful login from internal network',
    'event_type': 'login',
    'severity': 'info',
    'bytes_in': 2048,
    'bytes_out': 1024,
    'duration_ms': 100,
    'src_ip': '192.168.1.50'
}

emb1 = embedder.embed(log1)
emb2 = embedder.embed(log2)

print(f"Security log embedding dimension: {len(emb1)}")
print(f"  Text: 384, Categorical: 12, Numerical: 3, Network: 5")
print(f"\nLog similarity: {cosine_similarity([emb1], [emb2])[0][0]:.3f}")
Security log embedding dimension: 404
  Text: 384, Categorical: 12, Numerical: 3, Network: 5

Log similarity: 0.125

Training security log embeddings: While the code above uses simulated components, production security log embeddings require training both the categorical embeddings and the combination weights on your security data.

End-to-end training approach:

  1. Categorical embeddings: Train nn.Embedding layers for event_type, severity, and other categorical fields on your log corpus. Use either:

    • Supervised: Train on labeled anomaly detection task—embeddings learn to separate normal from malicious events
    • Self-supervised: Predict co-occurrence patterns—events that appear together in attack sequences get similar embeddings
  2. Weight optimization: Make weights trainable parameters instead of fixed hyperparameters:

    weights = nn.Parameter(torch.tensor([0.5, 0.2, 0.15, 0.15]))  # text, cat, num, domain
    weights = F.softmax(weights, dim=0)  # Ensure they sum to 1.0

    The model learns optimal weights during training—for threat detection, numerical features may dominate; for compliance search, text may dominate.

  3. Training objective: Choose based on your use case:

    • Anomaly detection: Contrastive loss—normal logs cluster together, anomalies far from normal cluster
    • Threat hunting: Triplet loss—logs from same attack campaign close together, different campaigns far apart
    • Alert triage: Classification loss—predict severity, alert priority, or true positive vs. false positive
  4. Multi-task learning: Train simultaneously on multiple objectives (anomaly detection + severity prediction + campaign clustering) with weighted loss combination. This prevents overfitting to one task.

Data requirements: 10K-100K labeled logs for supervised training (rare events, known attacks), or 100K-1M unlabeled logs for self-supervised approaches. Include temporal features if training for time-series anomaly detection.

Domain specialization: Pre-train on general security logs (OCSF corpus, public datasets) then fine-tune on your environment. This adapts embeddings to your specific threat landscape—healthcare sees different attacks than financial services.

See Chapter 14 for multi-objective training strategies and guidance on when domain-specific training justifies the cost vs. using off-the-shelf text embeddings with simple feature concatenation.

10.10 Choosing the Right Pattern

Advanced embedding pattern selection guide
Pattern Best For Trade-offs
Hybrid vectors Multi-faceted entities (logs, products) Requires weight tuning
Multi-vector (ColBERT) Fine-grained matching 10-100x storage
Matryoshka Variable quality/latency needs Requires special training
Learned sparse (SPLADE) Interpretability + performance More complex indexing
ROCKET time-series Pattern similarity Fixed representation
Binary/quantized Massive scale Quality loss
Session embeddings Behavioral patterns Requires sequence modeling

10.11 Key Takeaways

  • Naive concatenation fails when combining embeddings of different sizes—use weighted, normalized concatenation
  • Entity embeddings for categorical features outperform one-hot encoding by learning relationships between categories
  • Multi-vector representations (ColBERT) provide fine-grained matching at the cost of storage
  • Matryoshka embeddings enable quality/latency trade-offs at query time
  • Learned sparse embeddings (SPLADE) combine interpretability with semantic matching
  • Time-series patterns can be captured with ROCKET (fast, simple) or learned encoders (more expressive)
  • Domain-specific embeddings like security logs require thoughtful combination of semantic, categorical, numerical, and specialized features

10.12 Looking Ahead

This completes Part II on embedding types. Chapter 11 begins Part III: Core Applications, showing how to build retrieval-augmented generation systems that put these embeddings to work. For training custom embeddings with these patterns, Chapter 14 in Part IV provides guidance on when to build versus fine-tune.

10.13 Further Reading

  • Khattab, O. & Zaharia, M. (2020). “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR
  • Kusupati, A., et al. (2022). “Matryoshka Representation Learning.” NeurIPS
  • Formal, T., et al. (2021). “SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.” SIGIR
  • Dempster, A., et al. (2020). “ROCKET: Exceptionally Fast and Accurate Time Series Classification Using Random Convolutional Kernels.” Data Mining and Knowledge Discovery
  • Guo, C., et al. (2016). “Entity Embeddings of Categorical Variables.” arXiv:1604.06737