10 Advanced Embedding Types

Chapter Overview

Production embedding systems rarely use single, off-the-shelf embeddings. This chapter covers the advanced patterns that power real-world systems: hybrid vectors combining multiple feature types, multi-vector representations for fine-grained matching, learned sparse embeddings for interpretability, and domain-specific patterns for security, time-series, and structured data. These patterns build on the foundational types covered in Chapters 4-9.

10.1 Beyond Single Embeddings

The foundational embedding types—text, image, audio, and others—serve as building blocks. Production systems combine, extend, and specialize these foundations in sophisticated ways:

Hybrid embeddings combine semantic, categorical, numerical, and domain-specific features
Multi-vector representations use multiple embeddings per item for fine-grained matching
Learned sparse embeddings balance dense semantics with interpretable sparse features
Specialized architectures optimize for specific retrieval patterns

Understanding these patterns is essential for building embedding systems that perform well on real-world data.

10.2 Hybrid and Composite Embeddings

Real-world entities have multiple facets that single embeddings can’t capture. A security log has semantic content (message text), categorical features (event type, severity), numerical features (byte counts, durations), and domain-specific features (IP addresses). Hybrid embeddings combine all of these.

10.2.1 The Naive Approach Fails

Simple concatenation doesn’t work:

"""
Why Naive Concatenation Fails

When combining embeddings of different dimensions, larger vectors
dominate similarity calculations, drowning out smaller features.
"""

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

np.random.seed(42)

# Simulate: 384-dim text embedding + 10-dim numerical features
text_embedding = np.random.randn(384)
numerical_features = np.array([0.5, 0.8, 0.2, 0.1, 0.9, 0.3, 0.7, 0.4, 0.6, 0.5])

# Naive concatenation
naive_hybrid = np.concatenate([text_embedding, numerical_features])

# The problem: text embedding dominates
text_magnitude = np.linalg.norm(text_embedding)
num_magnitude = np.linalg.norm(numerical_features)

print("Magnitude comparison:")
print(f"  Text embedding (384 dims):     {text_magnitude:.2f}")
print(f"  Numerical features (10 dims):  {num_magnitude:.2f}")
print(f"  Ratio: {text_magnitude/num_magnitude:.1f}x")
print("\nThe text embedding will dominate similarity calculations!")

Magnitude comparison:
  Text embedding (384 dims):     18.67
  Numerical features (10 dims):  1.76
  Ratio: 10.6x

The text embedding will dominate similarity calculations!

10.2.2 Weighted Normalized Concatenation

The solution: normalize each component, then apply importance weights:

"""
Weighted Normalized Concatenation

Properly combines multiple feature types by:
1. L2-normalizing each component independently
2. Applying learned or tuned weights
3. Concatenating the weighted, normalized components
"""

import numpy as np
from sklearn.preprocessing import normalize

np.random.seed(42)

def create_hybrid_embedding(
    text_embedding: np.ndarray,
    categorical_embedding: np.ndarray,
    numerical_features: np.ndarray,
    domain_features: np.ndarray,
    weights: dict
) -> np.ndarray:
    """
    Create a hybrid embedding from multiple feature types.

    Args:
        text_embedding: Semantic embedding from text encoder (e.g., 384 dims)
        categorical_embedding: Learned embeddings for categorical features
        numerical_features: Scaled numerical features
        domain_features: Domain-specific features (e.g., IP encoding)
        weights: Importance weights for each component (should sum to 1.0)

    Returns:
        Hybrid embedding vector
    """
    # L2-normalize each component
    text_norm = normalize(text_embedding.reshape(1, -1))[0]
    cat_norm = normalize(categorical_embedding.reshape(1, -1))[0]
    num_norm = normalize(numerical_features.reshape(1, -1))[0]
    domain_norm = normalize(domain_features.reshape(1, -1))[0]

    # Apply weights and concatenate
    hybrid = np.concatenate([
        text_norm * weights['text'],
        cat_norm * weights['categorical'],
        num_norm * weights['numerical'],
        domain_norm * weights['domain']
    ])

    return hybrid

# Example: Security log embedding
text_emb = np.random.randn(384)  # From sentence transformer
cat_emb = np.random.randn(32)    # Learned embeddings for event_type, severity
num_feat = np.random.randn(10)   # Scaled: bytes_in, bytes_out, duration
domain_feat = np.array([0.75, 0.65, 0.003, 0.039, 1.0])  # IP octets + is_private

# Weights are hyperparameters to tune
weights = {
    'text': 0.50,        # Semantic content is most important
    'categorical': 0.20, # Event type matters
    'numerical': 0.15,   # Metrics provide context
    'domain': 0.15       # IP information for security
}

hybrid = create_hybrid_embedding(
    text_emb, cat_emb, num_feat, domain_feat, weights
)

print(f"Hybrid embedding dimension: {len(hybrid)}")
print(f"  Text component: 384 dims × {weights['text']} weight")
print(f"  Categorical: 32 dims × {weights['categorical']} weight")
print(f"  Numerical: 10 dims × {weights['numerical']} weight")
print(f"  Domain: 5 dims × {weights['domain']} weight")

Hybrid embedding dimension: 431
  Text component: 384 dims × 0.5 weight
  Categorical: 32 dims × 0.2 weight
  Numerical: 10 dims × 0.15 weight
  Domain: 5 dims × 0.15 weight

10.2.3 Entity Embeddings for Categorical Features

Don’t one-hot encode categorical features—learn embeddings for them:

"""
Entity Embeddings for Categorical Features

Learn dense representations for categorical values instead of sparse one-hot.
This captures relationships between categories (e.g., similar event types).
"""

import numpy as np

# Simulated learned embeddings for categorical features
# In practice, use nn.Embedding in PyTorch/TensorFlow

class CategoryEmbedder:
    """Simple category embedder (production would use nn.Embedding)."""

    def __init__(self, categories: list, embedding_dim: int = 8):
        self.categories = {cat: i for i, cat in enumerate(categories)}
        self.embedding_dim = embedding_dim
        # Initialize random embeddings (would be learned in practice)
        np.random.seed(42)
        self.embeddings = np.random.randn(len(categories), embedding_dim) * 0.1

    def embed(self, category: str) -> np.ndarray:
        idx = self.categories.get(category, 0)
        return self.embeddings[idx]

# Example: Event type embeddings for security logs
event_types = ['login', 'logout', 'file_access', 'network_connection',
               'process_start', 'process_end', 'privilege_escalation']
severity_levels = ['info', 'warning', 'error', 'critical']

event_embedder = CategoryEmbedder(event_types, embedding_dim=8)
severity_embedder = CategoryEmbedder(severity_levels, embedding_dim=4)

# Embed categorical features
event_emb = event_embedder.embed('login')
severity_emb = severity_embedder.embed('warning')

# Combine into categorical embedding
categorical_embedding = np.concatenate([event_emb, severity_emb])

print(f"Event embedding shape: {event_emb.shape}")
print(f"Severity embedding shape: {severity_emb.shape}")
print(f"Combined categorical embedding: {categorical_embedding.shape}")

Event embedding shape: (8,)
Severity embedding shape: (4,)
Combined categorical embedding: (12,)

10.2.4 Numerical Feature Preprocessing

Numerical features need careful preprocessing before embedding:

"""
Numerical Feature Preprocessing Pipeline

Proper preprocessing for numerical features:
1. Handle missing values
2. Apply log transform for long-tail distributions
3. Standardize to zero mean, unit variance
4. L2-normalize the result
"""

import numpy as np
from sklearn.preprocessing import StandardScaler

class NumericalPreprocessor:
    """Preprocess numerical features for embedding."""

    def __init__(self, feature_names: list):
        self.feature_names = feature_names
        self.scaler = StandardScaler()
        self.fitted = False

    def fit(self, data: np.ndarray):
        """Fit the scaler on training data."""
        # Apply log1p for long-tail features (bytes, counts)
        log_data = np.log1p(np.clip(data, 0, None))
        self.scaler.fit(log_data)
        self.fitted = True
        return self

    def transform(self, data: np.ndarray) -> np.ndarray:
        """Transform and normalize numerical features."""
        # Handle missing values
        data = np.nan_to_num(data, nan=0.0)

        # Log transform for long-tail distributions
        log_data = np.log1p(np.clip(data, 0, None))

        # Standardize
        if self.fitted:
            scaled = self.scaler.transform(log_data.reshape(1, -1))[0]
        else:
            scaled = log_data

        return scaled

# Example: Network metrics
feature_names = ['bytes_in', 'bytes_out', 'duration_ms', 'packet_count']
preprocessor = NumericalPreprocessor(feature_names)

# Simulate training data for fitting
train_data = np.array([
    [1024, 2048, 150, 10],
    [1000000, 500000, 5000, 1000],  # Long-tail values
    [512, 1024, 50, 5],
])
preprocessor.fit(train_data)

# Transform new data point
new_data = np.array([50000, 25000, 200, 50])
processed = preprocessor.transform(new_data)

print("Original features:", new_data)
print("Processed features:", np.round(processed, 3))

Original features: [50000 25000   200    50]
Processed features: [ 0.533  0.325 -0.265  0.102]

10.3 Multi-Vector Representations

Single vectors compress all information into one point. Multi-vector representations preserve more detail by using multiple vectors per item.

10.3.1 ColBERT-Style Late Interaction

ColBERT represents documents with one vector per token, enabling fine-grained matching:

"""
ColBERT-Style Multi-Vector Representation

Instead of one vector per document, use one vector per token.
Matching happens at the token level (late interaction).
"""

import numpy as np

def simulate_colbert_encoding(text: str, dim: int = 128) -> np.ndarray:
    """
    Simulate ColBERT token-level encoding.

    Returns: Matrix of shape (num_tokens, dim)
    """
    tokens = text.lower().split()
    np.random.seed(hash(text) % 2**32)
    # Each token gets its own embedding
    return np.random.randn(len(tokens), dim)

def colbert_similarity(query_vecs: np.ndarray, doc_vecs: np.ndarray) -> float:
    """
    ColBERT MaxSim: For each query token, find max similarity to any doc token.
    Sum these max similarities.
    """
    # Compute all pairwise similarities
    similarities = query_vecs @ doc_vecs.T  # (q_tokens, d_tokens)

    # MaxSim: max over document tokens for each query token
    max_sims = similarities.max(axis=1)

    return max_sims.sum()

# Example
query = "machine learning models"
doc1 = "deep learning neural network models for prediction"
doc2 = "cooking recipes and kitchen equipment"

query_vecs = simulate_colbert_encoding(query)
doc1_vecs = simulate_colbert_encoding(doc1)
doc2_vecs = simulate_colbert_encoding(doc2)

sim1 = colbert_similarity(query_vecs, doc1_vecs)
sim2 = colbert_similarity(query_vecs, doc2_vecs)

print(f"Query: '{query}'")
print(f"Query vectors shape: {query_vecs.shape}")
print(f"\nDoc 1: '{doc1}'")
print(f"Doc 1 vectors shape: {doc1_vecs.shape}")
print(f"Similarity: {sim1:.2f}")
print(f"\nDoc 2: '{doc2}'")
print(f"Similarity: {sim2:.2f}")

Query: 'machine learning models'
Query vectors shape: (3, 128)

Doc 1: 'deep learning neural network models for prediction'
Doc 1 vectors shape: (7, 128)
Similarity: 45.42

Doc 2: 'cooking recipes and kitchen equipment'
Similarity: 46.06

When to use multi-vector: - Fine-grained matching matters (exact phrase matching) - Documents are long and diverse - You can afford 10-100x storage overhead

10.4 Matryoshka Embeddings

Matryoshka (nested doll) embeddings encode information hierarchically—the first N dimensions are a valid embedding on their own:

"""
Matryoshka Embeddings: Variable-Length Representations

The first N dimensions form a valid embedding for any N.
Trade off quality vs. efficiency at query time.
"""

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Simulate Matryoshka embeddings (trained to work at multiple dimensions)
np.random.seed(42)

def simulate_matryoshka_embedding(text: str, full_dim: int = 768) -> np.ndarray:
    """
    Simulate a Matryoshka embedding where prefixes are valid embeddings.
    Real models are trained with a special loss to ensure this property.
    """
    np.random.seed(hash(text) % 2**32)
    return np.random.randn(full_dim)

texts = [
    "machine learning for natural language processing",
    "deep learning NLP models",
    "cooking italian pasta recipes",
]

embeddings = [simulate_matryoshka_embedding(t) for t in texts]

# Compare at different dimension prefixes
print("Similarity at different dimensions:\n")
for dim in [64, 128, 256, 768]:
    truncated = [e[:dim] for e in embeddings]
    sim_01 = cosine_similarity([truncated[0]], [truncated[1]])[0][0]
    sim_02 = cosine_similarity([truncated[0]], [truncated[2]])[0][0]
    print(f"  {dim} dims: ML↔DL={sim_01:.3f}, ML↔Cooking={sim_02:.3f}")

Similarity at different dimensions:

  64 dims: ML↔DL=-0.113, ML↔Cooking=-0.078
  128 dims: ML↔DL=-0.075, ML↔Cooking=-0.143
  256 dims: ML↔DL=-0.084, ML↔Cooking=-0.151
  768 dims: ML↔DL=-0.080, ML↔Cooking=-0.058

Benefits of Matryoshka embeddings: - Use short prefixes for fast initial retrieval - Use full dimensions for final reranking - Adapt to latency/quality requirements at runtime - Reduce storage by storing only needed dimensions

10.5 Learned Sparse Embeddings

SPLADE and similar models learn sparse representations that combine the best of dense and sparse retrieval:

"""
Learned Sparse Embeddings (SPLADE-style)

Learn to predict which vocabulary terms are important for a document.
Results in sparse vectors with interpretable dimensions (actual words).
"""

import numpy as np

def simulate_splade_embedding(text: str, vocab_size: int = 30000) -> dict:
    """
    Simulate SPLADE-style sparse embedding.

    Returns dict mapping vocabulary indices to importance weights.
    Real SPLADE uses a transformer to predict term importance.
    """
    words = text.lower().split()
    sparse = {}

    np.random.seed(hash(text) % 2**32)

    for word in words:
        # Simulate vocabulary index
        idx = hash(word) % vocab_size
        # Simulate learned importance weight
        weight = np.random.exponential(1.0)
        sparse[idx] = max(sparse.get(idx, 0), weight)

    # SPLADE also expands to related terms
    for _ in range(len(words)):
        expanded_idx = np.random.randint(vocab_size)
        sparse[expanded_idx] = np.random.exponential(0.5)

    return sparse

def sparse_dot_product(sparse1: dict, sparse2: dict) -> float:
    """Compute dot product of two sparse vectors."""
    score = 0.0
    for idx, weight1 in sparse1.items():
        if idx in sparse2:
            score += weight1 * sparse2[idx]
    return score

# Example
query = "machine learning models"
doc1 = "neural network deep learning"
doc2 = "kitchen cooking recipes"

q_sparse = simulate_splade_embedding(query)
d1_sparse = simulate_splade_embedding(doc1)
d2_sparse = simulate_splade_embedding(doc2)

print(f"Query sparse embedding: {len(q_sparse)} non-zero terms")
print(f"Doc 1 sparse embedding: {len(d1_sparse)} non-zero terms")
print(f"Doc 2 sparse embedding: {len(d2_sparse)} non-zero terms")
print(f"\nQuery ↔ Doc 1 (related): {sparse_dot_product(q_sparse, d1_sparse):.2f}")
print(f"Query ↔ Doc 2 (unrelated): {sparse_dot_product(q_sparse, d2_sparse):.2f}")

Query sparse embedding: 6 non-zero terms
Doc 1 sparse embedding: 8 non-zero terms
Doc 2 sparse embedding: 6 non-zero terms

Query ↔ Doc 1 (related): 0.28
Query ↔ Doc 2 (unrelated): 0.00

Benefits of learned sparse: - Interpretable (dimensions correspond to vocabulary terms) - Works with inverted indices (fast exact matching) - Captures term expansion (related terms) - Combines well with dense embeddings (hybrid search)

10.6 Time-Series Pattern Embeddings

Beyond basic statistical features, production systems use learned representations for time-series patterns.

10.6.1 ROCKET: Random Convolutional Kernels

ROCKET transforms time-series into features using random convolutional kernels:

"""
ROCKET-Style Time-Series Embeddings

Uses random convolutional kernels to extract features from time-series.
Fast to compute, works well for classification and similarity.
"""

import numpy as np

def generate_random_kernels(n_kernels: int = 100, max_length: int = 9) -> list:
    """Generate random convolutional kernels."""
    np.random.seed(42)
    kernels = []
    for _ in range(n_kernels):
        length = np.random.choice([3, 5, 7, 9])
        weights = np.random.randn(length)
        bias = np.random.randn()
        dilation = np.random.choice([1, 2, 4])
        kernels.append((weights, bias, dilation))
    return kernels

def apply_kernel(series: np.ndarray, kernel: tuple) -> tuple:
    """Apply a single kernel and extract features (max, ppv)."""
    weights, bias, dilation = kernel
    length = len(weights)

    # Dilated convolution
    output = []
    for i in range(len(series) - (length - 1) * dilation):
        indices = [i + j * dilation for j in range(length)]
        value = np.dot(series[indices], weights) + bias
        output.append(value)

    output = np.array(output)

    # ROCKET features: max value and proportion of positive values (PPV)
    max_val = np.max(output) if len(output) > 0 else 0
    ppv = np.mean(output > 0) if len(output) > 0 else 0

    return max_val, ppv

def rocket_embedding(series: np.ndarray, kernels: list) -> np.ndarray:
    """Create ROCKET embedding from time-series."""
    features = []
    for kernel in kernels:
        max_val, ppv = apply_kernel(series, kernel)
        features.extend([max_val, ppv])
    return np.array(features)

# Generate kernels (done once)
kernels = generate_random_kernels(n_kernels=50)

# Example time-series patterns
t = np.linspace(0, 4*np.pi, 100)
patterns = {
    'sine': np.sin(t) + np.random.randn(100) * 0.1,
    'cosine': np.cos(t) + np.random.randn(100) * 0.1,
    'trend_up': t/10 + np.random.randn(100) * 0.2,
    'random': np.random.randn(100),
}

# Create embeddings
embeddings = {name: rocket_embedding(series, kernels)
              for name, series in patterns.items()}

print(f"ROCKET embedding dimension: {len(embeddings['sine'])}")
print(f"  ({len(kernels)} kernels × 2 features each)")

# Compare patterns
from sklearn.metrics.pairwise import cosine_similarity
print("\nPattern similarities:")
print(f"  sine ↔ cosine: {cosine_similarity([embeddings['sine']], [embeddings['cosine']])[0][0]:.3f}")
print(f"  sine ↔ trend:  {cosine_similarity([embeddings['sine']], [embeddings['trend_up']])[0][0]:.3f}")
print(f"  sine ↔ random: {cosine_similarity([embeddings['sine']], [embeddings['random']])[0][0]:.3f}")

ROCKET embedding dimension: 100
  (50 kernels × 2 features each)

Pattern similarities:
  sine ↔ cosine: 0.998
  sine ↔ trend:  0.828
  sine ↔ random: 0.894

10.6.2 Learned Temporal Embeddings

For more complex patterns, use neural networks:

"""
Learned Temporal Embeddings

Use LSTMs, Transformers, or Temporal CNNs to learn time-series representations.
This example shows a simplified LSTM-style encoding.
"""

import numpy as np

class SimpleTemporalEncoder:
    """
    Simplified temporal encoder for illustration.
    Production systems use PyTorch/TensorFlow LSTM or Transformer.
    """

    def __init__(self, hidden_dim: int = 64):
        self.hidden_dim = hidden_dim
        np.random.seed(42)
        # Simplified: project statistics to hidden space
        self.projection = np.random.randn(10, hidden_dim) * 0.1

    def encode(self, series: np.ndarray) -> np.ndarray:
        """Encode time-series to fixed-length embedding."""
        # Extract temporal features
        features = np.array([
            np.mean(series),
            np.std(series),
            np.min(series),
            np.max(series),
            np.mean(np.diff(series)),  # Trend
            np.std(np.diff(series)),   # Volatility
            np.corrcoef(series[:-1], series[1:])[0, 1],  # Autocorrelation
            len(np.where(np.diff(np.sign(series)))[0]),  # Zero crossings
            np.percentile(series, 25),
            np.percentile(series, 75),
        ])
        features = np.nan_to_num(features)

        # Project to embedding space
        embedding = np.tanh(features @ self.projection)
        return embedding

encoder = SimpleTemporalEncoder(hidden_dim=64)

# Encode different patterns
t = np.linspace(0, 4*np.pi, 100)
embeddings = {
    'periodic': encoder.encode(np.sin(t)),
    'trending': encoder.encode(t / 10),
    'volatile': encoder.encode(np.random.randn(100)),
}

print(f"Temporal embedding dimension: {len(embeddings['periodic'])}")

Temporal embedding dimension: 64

10.7 Binary and Quantized Embeddings

For massive scale, compress embeddings to reduce storage and accelerate search:

"""
Binary and Quantized Embeddings

Compress embeddings for efficiency:
- Binary: Each dimension → 1 bit (32x compression)
- Product Quantization: Learn codebooks for compression
"""

import numpy as np

def binarize_embedding(embedding: np.ndarray) -> np.ndarray:
    """Convert to binary embedding (sign of each dimension)."""
    return (embedding > 0).astype(np.int8)

def hamming_distance(bin1: np.ndarray, bin2: np.ndarray) -> int:
    """Hamming distance between binary vectors."""
    return np.sum(bin1 != bin2)

def hamming_similarity(bin1: np.ndarray, bin2: np.ndarray) -> float:
    """Normalized Hamming similarity (0 to 1)."""
    return 1 - hamming_distance(bin1, bin2) / len(bin1)

# Example: Compare binary vs float embeddings
np.random.seed(42)
emb1 = np.random.randn(768)
emb2 = emb1 + np.random.randn(768) * 0.5  # Similar
emb3 = np.random.randn(768)  # Different

# Float similarity
from sklearn.metrics.pairwise import cosine_similarity
float_sim_12 = cosine_similarity([emb1], [emb2])[0][0]
float_sim_13 = cosine_similarity([emb1], [emb3])[0][0]

# Binary similarity
bin1, bin2, bin3 = [binarize_embedding(e) for e in [emb1, emb2, emb3]]
bin_sim_12 = hamming_similarity(bin1, bin2)
bin_sim_13 = hamming_similarity(bin1, bin3)

print("Float vs Binary similarity comparison:")
print(f"\n  Similar pair:")
print(f"    Float cosine: {float_sim_12:.3f}")
print(f"    Binary Hamming: {bin_sim_12:.3f}")
print(f"\n  Different pair:")
print(f"    Float cosine: {float_sim_13:.3f}")
print(f"    Binary Hamming: {bin_sim_13:.3f}")

print(f"\nStorage comparison for 768-dim embedding:")
print(f"  Float32: {768 * 4} bytes")
print(f"  Binary:  {768 // 8} bytes ({768 * 4 / (768 // 8):.0f}x compression)")

Float vs Binary similarity comparison:

  Similar pair:
    Float cosine: 0.894
    Binary Hamming: 0.859

  Different pair:
    Float cosine: -0.016
    Binary Hamming: 0.504

Storage comparison for 768-dim embedding:
  Float32: 3072 bytes
  Binary:  96 bytes (32x compression)

When to use quantized embeddings: - Billions of vectors (storage constraints) - Latency-critical applications - First-stage retrieval (rerank with full precision) - Edge deployment

10.8 Session and Behavioral Embeddings

Embed user sessions and behaviors as sequences:

"""
Session and Behavioral Embeddings

Embed sequences of user actions to capture behavioral patterns.
Similar sessions (browsing patterns) get similar embeddings.
"""

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

class SessionEncoder:
    """Encode user sessions as embeddings."""

    def __init__(self, action_vocab: list, embedding_dim: int = 64):
        self.action_vocab = {a: i for i, a in enumerate(action_vocab)}
        self.embedding_dim = embedding_dim
        np.random.seed(42)
        # Action embeddings (would be learned)
        self.action_embeddings = np.random.randn(len(action_vocab), embedding_dim) * 0.1

    def encode_session(self, actions: list) -> np.ndarray:
        """Encode a session (sequence of actions) to single embedding."""
        if not actions:
            return np.zeros(self.embedding_dim)

        # Get embeddings for each action
        action_embs = []
        for action in actions:
            if action in self.action_vocab:
                idx = self.action_vocab[action]
                action_embs.append(self.action_embeddings[idx])

        if not action_embs:
            return np.zeros(self.embedding_dim)

        # Combine with weighted average (recent actions weighted more)
        weights = np.exp(np.linspace(-1, 0, len(action_embs)))
        weights /= weights.sum()

        session_emb = np.average(action_embs, axis=0, weights=weights)
        return session_emb

# Define action vocabulary
actions = ['view_product', 'add_to_cart', 'remove_from_cart',
           'view_category', 'search', 'checkout', 'view_reviews']

encoder = SessionEncoder(actions)

# Example sessions
shopping_session = ['view_category', 'view_product', 'view_reviews',
                    'add_to_cart', 'view_product', 'add_to_cart', 'checkout']
browsing_session = ['view_category', 'view_product', 'view_category',
                    'search', 'view_product', 'view_category']
cart_abandon = ['view_product', 'add_to_cart', 'view_product',
                'add_to_cart', 'remove_from_cart']

emb_shopping = encoder.encode_session(shopping_session)
emb_browsing = encoder.encode_session(browsing_session)
emb_abandon = encoder.encode_session(cart_abandon)

print("Session similarities:")
print(f"  Shopping ↔ Browsing: {cosine_similarity([emb_shopping], [emb_browsing])[0][0]:.3f}")
print(f"  Shopping ↔ Cart abandon: {cosine_similarity([emb_shopping], [emb_abandon])[0][0]:.3f}")
print(f"  Browsing ↔ Cart abandon: {cosine_similarity([emb_browsing], [emb_abandon])[0][0]:.3f}")

Session similarities:
  Shopping ↔ Browsing: 0.308
  Shopping ↔ Cart abandon: 0.747
  Browsing ↔ Cart abandon: 0.243

10.9 Domain-Specific Embeddings

Some domains require specialized embedding approaches.

10.9.1 Security Log Embeddings

Combining semantic, categorical, numerical, and network features:

"""
Security Log Embedding (OCSF-style)

Hybrid embedding for security events combining:
- Semantic: Log message content
- Categorical: Event type, severity, status
- Numerical: Byte counts, durations
- Network: IP address encoding
"""

import numpy as np
from sklearn.preprocessing import normalize

def encode_ip_address(ip: str) -> np.ndarray:
    """
    Encode IP address as 5-dim vector:
    - 4 normalized octets
    - 1 is_private indicator
    """
    try:
        octets = [int(x) for x in ip.split('.')]
        normalized = [o / 255.0 for o in octets]

        # Check if private IP
        is_private = (
            octets[0] == 10 or
            (octets[0] == 172 and 16 <= octets[1] <= 31) or
            (octets[0] == 192 and octets[1] == 168)
        )

        return np.array(normalized + [float(is_private)])
    except:
        return np.zeros(5)

class SecurityLogEmbedder:
    """Create hybrid embeddings for security logs."""

    def __init__(self):
        np.random.seed(42)
        # Simulated text encoder (would use sentence-transformers)
        self.text_dim = 384
        # Category embeddings
        self.event_types = ['login', 'logout', 'file_access', 'network', 'process']
        self.event_embeddings = np.random.randn(len(self.event_types), 8) * 0.1
        self.severities = ['info', 'warning', 'error', 'critical']
        self.severity_embeddings = np.random.randn(len(self.severities), 4) * 0.1

        # Weights for combining
        self.weights = {
            'text': 0.50,
            'categorical': 0.20,
            'numerical': 0.15,
            'network': 0.15
        }

    def embed(self, log: dict) -> np.ndarray:
        """Create hybrid embedding for a security log."""
        # Text embedding (simulated)
        np.random.seed(hash(log.get('message', '')) % 2**32)
        text_emb = np.random.randn(self.text_dim)

        # Categorical embeddings
        event_idx = self.event_types.index(log.get('event_type', 'network'))
        severity_idx = self.severities.index(log.get('severity', 'info'))
        cat_emb = np.concatenate([
            self.event_embeddings[event_idx],
            self.severity_embeddings[severity_idx]
        ])

        # Numerical features
        num_features = np.array([
            np.log1p(log.get('bytes_in', 0)),
            np.log1p(log.get('bytes_out', 0)),
            np.log1p(log.get('duration_ms', 0)),
        ])

        # Network features
        ip_emb = encode_ip_address(log.get('src_ip', '0.0.0.0'))

        # Normalize and weight
        text_norm = normalize(text_emb.reshape(1, -1))[0] * self.weights['text']
        cat_norm = normalize(cat_emb.reshape(1, -1))[0] * self.weights['categorical']
        num_norm = normalize(num_features.reshape(1, -1))[0] * self.weights['numerical']
        ip_norm = normalize(ip_emb.reshape(1, -1))[0] * self.weights['network']

        return np.concatenate([text_norm, cat_norm, num_norm, ip_norm])

# Example
embedder = SecurityLogEmbedder()

log1 = {
    'message': 'Failed login attempt from external IP',
    'event_type': 'login',
    'severity': 'warning',
    'bytes_in': 1024,
    'bytes_out': 512,
    'duration_ms': 150,
    'src_ip': '203.0.113.50'
}

log2 = {
    'message': 'Successful login from internal network',
    'event_type': 'login',
    'severity': 'info',
    'bytes_in': 2048,
    'bytes_out': 1024,
    'duration_ms': 100,
    'src_ip': '192.168.1.50'
}

emb1 = embedder.embed(log1)
emb2 = embedder.embed(log2)

print(f"Security log embedding dimension: {len(emb1)}")
print(f"  Text: 384, Categorical: 12, Numerical: 3, Network: 5")
print(f"\nLog similarity: {cosine_similarity([emb1], [emb2])[0][0]:.3f}")

Security log embedding dimension: 404
  Text: 384, Categorical: 12, Numerical: 3, Network: 5

Log similarity: 0.176

10.10 Choosing the Right Pattern

Advanced embedding pattern selection guide
Pattern	Best For	Trade-offs
Hybrid vectors	Multi-faceted entities (logs, products)	Requires weight tuning
Multi-vector (ColBERT)	Fine-grained matching	10-100x storage
Matryoshka	Variable quality/latency needs	Requires special training
Learned sparse (SPLADE)	Interpretability + performance	More complex indexing
ROCKET time-series	Pattern similarity	Fixed representation
Binary/quantized	Massive scale	Quality loss
Session embeddings	Behavioral patterns	Requires sequence modeling

10.11 Key Takeaways

Naive concatenation fails when combining embeddings of different sizes—use weighted, normalized concatenation
Entity embeddings for categorical features outperform one-hot encoding by learning relationships between categories
Multi-vector representations (ColBERT) provide fine-grained matching at the cost of storage
Matryoshka embeddings enable quality/latency trade-offs at query time
Learned sparse embeddings (SPLADE) combine interpretability with semantic matching
Time-series patterns can be captured with ROCKET (fast, simple) or learned encoders (more expressive)
Domain-specific embeddings like security logs require thoughtful combination of semantic, categorical, numerical, and specialized features

10.12 Looking Ahead

This completes Part II on embedding types. Chapter 11 begins Part III: Core Applications, showing how to build retrieval-augmented generation systems that put these embeddings to work. For training custom embeddings with these patterns, Chapter 14 in Part IV provides guidance on when to build versus fine-tune.

10.13 Further Reading

Khattab, O. & Zaharia, M. (2020). “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” SIGIR
Kusupati, A., et al. (2022). “Matryoshka Representation Learning.” NeurIPS
Formal, T., et al. (2021). “SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.” SIGIR
Dempster, A., et al. (2020). “ROCKET: Exceptionally Fast and Accurate Time Series Classification Using Random Convolutional Kernels.” Data Mining and Knowledge Discovery
Guo, C., et al. (2016). “Entity Embeddings of Categorical Variables.” arXiv:1604.06737