18  Advanced Embedding Techniques

NoteChapter Overview

As embedding systems mature, organizations need techniques that go beyond standard vector representations. This chapter explores five advanced approaches that address complex real-world challenges: hierarchical embeddings that preserve taxonomic structure, dynamic embeddings that capture temporal evolution, compositional embeddings for complex entities, uncertainty quantification for trustworthy predictions, and federated learning for privacy-preserving embedding training. These techniques unlock new possibilities for organizations handling structured knowledge graphs, time-varying data, multi-faceted entities, high-stakes decisions, and distributed sensitive data.

18.1 Hierarchical Embeddings for Taxonomies

Many enterprise domains have inherent hierarchical structure: product catalogs with categories and subcategories, organizational charts with departments and teams, medical ontologies with disease classifications, and scientific taxonomies. Standard embeddings treat all items as independent points in space, losing this valuable structural information. Hierarchical embeddings preserve taxonomic relationships while maintaining the benefits of vector representations.

18.1.1 The Hierarchical Challenge

Consider an e-commerce product catalog:

Electronics
├── Computers
│   ├── Laptops
│   │   ├── Gaming Laptops
│   │   └── Business Laptops
│   └── Desktops
└── Mobile Devices
    ├── Smartphones
    └── Tablets

A standard embedding might place “Gaming Laptops” and “Tablets” closer than “Gaming Laptops” and “Business Laptops”, even though the latter share more hierarchical structure. Hierarchical embeddings ensure that:

  1. Distance reflects hierarchy: Items in the same subtree are closer
  2. Transitivity is preserved: If A is parent of B and B is parent of C, embeddings reflect this chain
  3. Level information is encoded: Embeddings capture depth in the hierarchy

18.1.2 Hyperbolic Embeddings for Hierarchies

Euclidean space has a fundamental limitation: the number of points at distance \(d\) grows polynomially. Tree structures, however, grow exponentially—the number of nodes doubles at each level. Hyperbolic space has negative curvature, allowing exponential volume growth that naturally matches tree structure.

The Poincaré ball model represents hyperbolic space as the unit ball in Euclidean space with a special distance metric:

Show Hyperbolic Embedding Implementation
import torch
import torch.nn as nn


class HyperbolicEmbedding(nn.Module):
    """Hyperbolic embeddings in Poincaré ball for hierarchical data."""

    def __init__(self, num_items, embedding_dim, curvature=1.0):
        super().__init__()
        self.curvature = curvature
        self.embeddings = nn.Embedding(num_items, embedding_dim)
        nn.init.uniform_(self.embeddings.weight, -1e-3, 1e-3)

    def poincare_distance(self, u, v):
        """Compute Poincaré distance between points u and v."""
        sqrt_c = self.curvature ** 0.5
        diff_norm_sq = torch.sum((u - v) ** 2, dim=-1)
        u_norm_sq = torch.sum(u ** 2, dim=-1)
        v_norm_sq = torch.sum(v ** 2, dim=-1)

        numerator = 2 * diff_norm_sq
        denominator = (1 - u_norm_sq) * (1 - v_norm_sq)
        return torch.acosh(1 + numerator / (denominator + 1e-7)) / sqrt_c

    def project_to_ball(self, x, eps=1e-5):
        """Project points to Poincaré ball (norm < 1)."""
        norm = torch.norm(x, p=2, dim=-1, keepdim=True)
        max_norm = 1 - eps
        return x / torch.clamp(norm / max_norm, min=1.0)

    def forward(self, indices):
        """Get embeddings and project to Poincaré ball."""
        emb = self.embeddings(indices)
        return self.project_to_ball(emb)


# Usage example
model = HyperbolicEmbedding(num_items=1000, embedding_dim=10, curvature=1.0)
indices = torch.tensor([0, 1, 10])
embeddings = model(indices)
distance = model.poincare_distance(embeddings[0], embeddings[1])
print(f"Hyperbolic distance: {distance.item():.4f}")
Hyperbolic distance: 0.0050

18.1.3 Enterprise Applications of Hierarchical Embeddings

1. Product Recommendation with Category Awareness

Show Hierarchical Product Recommender
import torch
import torch.nn as nn


class HierarchicalProductRecommender:
    """Product recommendation system using hyperbolic embeddings for category-aware recommendations."""

    def __init__(self, product_catalog, embedding_dim=10, curvature=1.0):
        self.catalog = product_catalog
        self.hyperbolic_model = HyperbolicEmbedding(len(product_catalog), embedding_dim, curvature)

    def recommend(self, product_id, top_k=10, category_weight=0.3):
        """Recommend products based on hyperbolic distance and category structure."""
        query_emb = self.hyperbolic_model(torch.tensor([product_id]))

        distances = []
        for pid in range(len(self.catalog)):
            if pid == product_id:
                continue
            prod_emb = self.hyperbolic_model(torch.tensor([pid]))
            dist = self.hyperbolic_model.poincare_distance(query_emb, prod_emb)
            distances.append((pid, dist.item()))

        distances.sort(key=lambda x: x[1])
        return [pid for pid, _ in distances[:top_k]]


# Usage example
catalog = {"laptop_gaming": 0, "laptop_business": 1, "phone": 2}
recommender = HierarchicalProductRecommender(catalog, embedding_dim=10)
recommendations = recommender.recommend(product_id=0, top_k=5)
print(f"Recommendations for product 0: {recommendations}")
Recommendations for product 0: [1, 2]

2. Knowledge Graph Embeddings

Medical ontologies, scientific taxonomies, and corporate knowledge bases benefit enormously from hyperbolic embeddings:

def embed_medical_ontology():
    """
    Medical ontology example: Disease hierarchies

    ICD-10 codes have 14,000+ diseases organized hierarchically
    Hyperbolic embeddings in 10-20 dimensions outperform
    Euclidean embeddings in 300-500 dimensions
    """
    # Example: Simplified disease taxonomy
    disease_taxonomy = {
        # Cardiovascular diseases
        'myocardial_infarction': 'ischemic_heart_disease',
        'angina': 'ischemic_heart_disease',
        'ischemic_heart_disease': 'cardiovascular_disease',

        'atrial_fibrillation': 'arrhythmia',
        'ventricular_tachycardia': 'arrhythmia',
        'arrhythmia': 'cardiovascular_disease',

        # Respiratory diseases
        'pneumonia': 'lower_respiratory_infection',
        'bronchitis': 'lower_respiratory_infection',
        'lower_respiratory_infection': 'respiratory_disease',

        'asthma': 'chronic_respiratory_disease',
        'copd': 'chronic_respiratory_disease',
        'chronic_respiratory_disease': 'respiratory_disease',
    }

    trainer = HierarchicalEmbeddingTrainer(
        disease_taxonomy,
        embedding_dim=10,
        curvature=1.0
    )

    trainer.train(num_epochs=2000, verbose=True)

    return trainer
TipDimensionality Advantage

Hyperbolic embeddings typically achieve better hierarchical preservation in 5-20 dimensions than Euclidean embeddings in 100-500 dimensions. This reduces storage by 20-100x and speeds up similarity search by 10-50x.

WarningTraining Stability

Hyperbolic optimization can be unstable near the boundary of the Poincaré ball. Always use projection after gradient steps and consider adaptive learning rates that decrease when approaching the boundary.

18.2 Dynamic Embeddings for Temporal Data

Most embedding systems assume data is static: a document has one embedding, a product has one representation. But real-world entities evolve: user interests shift, document relevance decays, product popularity cycles, and word meanings drift. Dynamic embeddings capture this temporal dimension.

18.2.1 The Temporal Challenge

Consider a news article about “AI”:

  • 2015: “AI” meant primarily machine learning and narrow applications
  • 2020: “AI” included transformers, GPT models, and broader capabilities
  • 2025: “AI” encompasses multimodal models, agents, and reasoning systems

A static embedding averages these meanings, losing temporal context. A dynamic embedding maintains separate representations for each time period or evolves continuously.

18.2.2 Approaches to Dynamic Embeddings

1. Discrete Time Slices: Separate embeddings per time window 2. Continuous Evolution: Embeddings as functions of time 3. Recurrent Updates: Update embeddings based on new observations

Show Dynamic Embedding
import torch
import torch.nn as nn


class DynamicEmbedding(nn.Module):
    """Dynamic embeddings that evolve over time based on user interactions."""

    def __init__(self, num_items, embedding_dim, num_time_slices=10):
        super().__init__()
        self.num_time_slices = num_time_slices
        self.base_embeddings = nn.Embedding(num_items, embedding_dim)
        self.temporal_adjustment = nn.Embedding(num_time_slices, embedding_dim)

    def forward(self, item_ids, time_slice_ids):
        """Get time-aware embeddings."""
        base_emb = self.base_embeddings(item_ids)
        temporal_adj = self.temporal_adjustment(time_slice_ids)
        return base_emb + 0.1 * temporal_adj

    def update_from_interactions(self, item_id, interaction_embedding, learning_rate=0.01):
        """Incrementally update embeddings based on new interactions."""
        with torch.no_grad():
            current = self.base_embeddings.weight[item_id]
            self.base_embeddings.weight[item_id] = current + learning_rate * (interaction_embedding - current)


# Usage example
model = DynamicEmbedding(num_items=1000, embedding_dim=128, num_time_slices=24)
item_ids = torch.tensor([10, 20, 30])
time_ids = torch.tensor([5, 5, 10])
embeddings = model(item_ids, time_ids)
print(f"Dynamic embeddings shape: {embeddings.shape}")
Dynamic embeddings shape: torch.Size([3, 128])

18.2.3 Production Deployment of Dynamic Embeddings

TipStreaming Updates at Scale

For systems with millions of users and billions of interactions:

  1. Batch updates: Accumulate interactions over 5-15 minute windows, update in batch
  2. Incremental training: Update only affected embeddings, not full model
  3. Asynchronous updates: Background process updates embeddings while serving layer uses stale (but recent) versions
  4. Versioned embeddings: Maintain multiple versions (current, 5min old, 1hr old) for consistency
Show Streaming Embedding Service
import asyncio
from collections import deque
from datetime import datetime
import torch


class StreamingEmbeddingService:
    """Real-time embedding service with streaming updates."""

    def __init__(self, model, update_interval_seconds=60):
        self.model = model
        self.update_interval = update_interval_seconds
        self.pending_updates = deque()
        self.last_update = datetime.now()

    async def queue_interaction(self, item_id, interaction_data):
        """Queue interaction for batch update."""
        self.pending_updates.append((item_id, interaction_data))
        if len(self.pending_updates) >= 100 or (datetime.now() - self.last_update).total_seconds() > self.update_interval:
            await self.flush_updates()

    async def flush_updates(self):
        """Apply pending updates in batch."""
        if not self.pending_updates:
            return

        updates = list(self.pending_updates)
        self.pending_updates.clear()

        for item_id, data in updates:
            self.model.update_from_interactions(item_id, torch.randn(128), learning_rate=0.01)

        self.last_update = datetime.now()
        print(f"Flushed {len(updates)} updates")

# Usage example
model = DynamicEmbedding(num_items=1000, embedding_dim=128)
service = StreamingEmbeddingService(model, update_interval_seconds=60)
print("Streaming service initialized for real-time updates")
Streaming service initialized for real-time updates
WarningTemporal Leakage

When training dynamic embeddings, never use future information to create past embeddings. This temporal leakage leads to unrealistically high accuracy in backtesting but fails in production. Always train with strict time-based splits.

18.3 Compositional Embeddings for Complex Entities

Real-world entities are rarely atomic—they’re compositions of multiple components:

  • Documents: Title + body + metadata + author + date
  • Products: Category + brand + attributes + reviews + images
  • Users: Demographics + behavior + preferences + context
  • Transactions: Buyer + seller + item + time + location + amount

Compositional embeddings explicitly model these structures, learning how to combine component embeddings into coherent entity representations.

18.3.1 Why Composition Matters

A naive approach: concatenate or average component embeddings. This fails because:

  1. Components have different importance: Product brand matters more than box color
  2. Interactions exist: Laptop + Gaming Category ≠ Laptop + Business Category
  3. Context varies: User embedding should weight differently for recommendations vs. fraud detection

Compositional embeddings learn how to combine components, not just what the components are.

18.3.2 Approaches to Composition

Show Compositional Embedding
import torch
import torch.nn as nn


class CompositionalEmbedding(nn.Module):
    """Learn to compose embeddings from multiple components using attention."""

    def __init__(self, component_dims, output_dim=128):
        super().__init__()
        self.component_encoders = nn.ModuleList([
            nn.Linear(dim, output_dim) for dim in component_dims
        ])
        self.attention = nn.MultiheadAttention(output_dim, num_heads=4, batch_first=True)
        self.output_proj = nn.Linear(output_dim, output_dim)

    def forward(self, components, component_mask=None):
        """Compose embeddings from multiple components.

        Args:
            components: List of tensors, one per component
            component_mask: Boolean mask for missing components
        """
        encoded = []
        for i, comp in enumerate(components):
            if comp is not None:
                encoded.append(self.component_encoders[i](comp))
            else:
                encoded.append(torch.zeros(comp.size(0), self.component_encoders[i].out_features))

        stacked = torch.stack(encoded, dim=1)
        attended, _ = self.attention(stacked, stacked, stacked, key_padding_mask=component_mask)
        return self.output_proj(attended.mean(dim=1))


# Usage example
model = CompositionalEmbedding(component_dims=[64, 128, 32], output_dim=128)
components = [torch.randn(16, 64), torch.randn(16, 128), torch.randn(16, 32)]
composed = model(components)
print(f"Composed embedding shape: {composed.shape}")
Composed embedding shape: torch.Size([16, 128])

18.3.3 Task-Specific Composition Weights

A powerful extension: learn different composition weights for different tasks.

Show Task-Adaptive Composition
import torch
import torch.nn as nn


class TaskAdaptiveComposition(nn.Module):
    """Learn task-specific composition weights for multi-component entities."""

    def __init__(self, num_components, embedding_dim, num_tasks=3):
        super().__init__()
        self.component_embeddings = nn.ModuleList([
            nn.Embedding(1000, embedding_dim) for _ in range(num_components)
        ])
        self.task_weights = nn.Embedding(num_tasks, num_components)
        nn.init.uniform_(self.task_weights.weight, 0, 1)

    def forward(self, component_ids, task_id):
        """Compose embeddings with task-specific weights."""
        component_embs = [enc(ids) for enc, ids in zip(self.component_embeddings, component_ids)]
        stacked = torch.stack(component_embs, dim=1)

        weights = torch.softmax(self.task_weights(task_id), dim=-1)
        weighted = stacked * weights.unsqueeze(-1)
        return weighted.sum(dim=1)


# Usage example
model = TaskAdaptiveComposition(num_components=3, embedding_dim=64, num_tasks=3)
comp_ids = [torch.tensor([10]), torch.tensor([20]), torch.tensor([30])]
task_id = torch.tensor([1])
composed = model(comp_ids, task_id)
print(f"Task-adaptive composed embedding: {composed.shape}")
Task-adaptive composed embedding: torch.Size([1, 64])
TipHandling Missing Components

Real-world data often has missing components (products without images, documents without abstracts). Use attention with component masks to handle missing data gracefully—the model automatically re-weights remaining components.

18.4 Uncertainty Quantification in Embeddings

Embedding systems make high-stakes decisions: loan approvals, medical diagnoses, autonomous vehicle navigation. A confidence score is as important as the prediction itself. Uncertainty quantification tells us when to trust an embedding-based decision and when to defer to human judgment or request more information.

18.4.1 Sources of Uncertainty

  1. Aleatoric uncertainty: Inherent noise in data (e.g., blurry images, ambiguous text)
  2. Epistemic uncertainty: Model’s lack of knowledge (e.g., never seen this type of input before)
  3. Distribution shift: Input differs from training distribution

Standard embeddings provide point estimates with no uncertainty. We need probabilistic embeddings that capture confidence.

18.4.2 Approaches to Uncertainty Quantification

Show Probabilistic Embedding
import torch
import torch.nn as nn


class ProbabilisticEmbedding(nn.Module):
    """Embeddings with uncertainty quantification using variational approach."""

    def __init__(self, num_items, embedding_dim):
        super().__init__()
        self.mean_embeddings = nn.Embedding(num_items, embedding_dim)
        self.logvar_embeddings = nn.Embedding(num_items, embedding_dim)

    def forward(self, item_ids, num_samples=1):
        """Sample from embedding distribution."""
        mean = self.mean_embeddings(item_ids)
        logvar = self.logvar_embeddings(item_ids)
        std = torch.exp(0.5 * logvar)

        if num_samples == 1:
            eps = torch.randn_like(std)
            return mean + eps * std, std
        else:
            samples = []
            for _ in range(num_samples):
                eps = torch.randn_like(std)
                samples.append(mean + eps * std)
            return torch.stack(samples), std

    def uncertainty(self, item_ids):
        """Get uncertainty scores."""
        logvar = self.logvar_embeddings(item_ids)
        return torch.exp(0.5 * logvar).mean(dim=-1)


# Usage example
model = ProbabilisticEmbedding(num_items=1000, embedding_dim=128)
items = torch.tensor([10, 20, 30])
embeddings, uncertainty = model(items)
uncertainty_scores = model.uncertainty(items)
print(f"Embeddings: {embeddings.shape}, Uncertainty: {uncertainty_scores}")
Embeddings: torch.Size([3, 128]), Uncertainty: tensor([1.1180, 1.0743, 1.0795], grad_fn=<MeanBackward1>)
WarningCalibration is Critical

Uncertainty estimates must be calibrated: if the model says 80% confidence, it should be correct 80% of the time. Uncalibrated uncertainty is misleading and dangerous. Always validate on held-out test set and use temperature scaling or Platt scaling to calibrate.

TipWhen to Use Uncertainty Quantification

Essential for:

  • High-stakes decisions: Healthcare, finance, autonomous systems, legal
  • Out-of-distribution detection: Detect when input differs from training data
  • Active learning: Select most informative examples to label next
  • Trustworthy AI: Provide confidence scores to users

Not necessary for:

  • Low-stakes applications (music recommendations, article suggestions)
  • Internal R&D where errors are acceptable
  • Applications with human-in-the-loop review anyway

18.5 Federated Embedding Learning

Many organizations have valuable data they cannot share: medical records, financial transactions, personal communications. Federated learning enables training embeddings across multiple data silos without centralizing the data. Each participant trains locally and shares only model updates, preserving privacy.

18.5.1 The Federated Learning Paradigm

Traditional centralized training: 1. Collect all data in one place 2. Train embedding model 3. Deploy to all clients

Problem: Data cannot be centralized due to privacy, regulations (GDPR, HIPAA), competitive concerns, or data volume.

Federated training: 1. Each client trains on local data 2. Clients share model updates (gradients, embeddings) 3. Central server aggregates updates 4. Repeat until convergence

Show Federated Embedding Server
import torch
import torch.nn as nn


class FederatedEmbeddingServer:
    """Central server for federated embedding learning."""

    def __init__(self, global_model, num_clients=10):
        self.global_model = global_model
        self.num_clients = num_clients
        self.client_weights = [1.0 / num_clients] * num_clients

    def aggregate_updates(self, client_models):
        """Aggregate model updates from clients using weighted average."""
        global_dict = self.global_model.state_dict()

        for key in global_dict.keys():
            global_dict[key] = torch.zeros_like(global_dict[key])
            for i, client_model in enumerate(client_models):
                client_dict = client_model.state_dict()
                global_dict[key] += self.client_weights[i] * client_dict[key]

        self.global_model.load_state_dict(global_dict)

    def distribute_model(self):
        """Send updated global model to clients."""
        return self.global_model.state_dict()


# Usage example
global_model = nn.Embedding(1000, 128)
server = FederatedEmbeddingServer(global_model, num_clients=5)
print("Federated server initialized for distributed training")
Federated server initialized for distributed training

18.5.2 Privacy-Preserving Techniques

1. Differential Privacy: Add calibrated noise to updates

Show Differentially Private Embedding
import torch
import torch.nn as nn


class DifferentiallyPrivateEmbedding:
    """Add differential privacy noise to embeddings for privacy preservation."""

    def __init__(self, model, epsilon=1.0, delta=1e-5):
        self.model = model
        self.epsilon = epsilon
        self.delta = delta
        self.sensitivity = 1.0

    def add_noise(self, gradients):
        """Add calibrated Gaussian noise for differential privacy."""
        sigma = (self.sensitivity * torch.sqrt(2 * torch.log(torch.tensor(1.25 / self.delta)))) / self.epsilon
        noisy_gradients = {}
        for key, grad in gradients.items():
            noise = torch.randn_like(grad) * sigma
            noisy_gradients[key] = grad + noise
        return noisy_gradients

    def private_train_step(self, batch, optimizer):
        """Training step with differential privacy."""
        optimizer.zero_grad()
        loss = self.model(batch)
        loss.backward()

        gradients = {name: param.grad.clone() for name, param in self.model.named_parameters() if param.grad is not None}
        noisy_grads = self.add_noise(gradients)

        for name, param in self.model.named_parameters():
            if name in noisy_grads:
                param.grad = noisy_grads[name]

        optimizer.step()
        return loss.item()


# Usage example
model = nn.Embedding(1000, 128)
dp_trainer = DifferentiallyPrivateEmbedding(model, epsilon=1.0)
print(f"DP training with epsilon={dp_trainer.epsilon}")
DP training with epsilon=1.0

2. Secure Aggregation: Encrypt updates before sharing

Show Secure Aggregation
import torch


class SecureAggregation:
    """Secure aggregation using secret sharing for federated learning."""

    def __init__(self, num_clients):
        self.num_clients = num_clients

    def add_secret_shares(self, model_update):
        """Add secret shares to model update for secure aggregation."""
        shares = []
        for _ in range(self.num_clients - 1):
            share = {k: torch.randn_like(v) for k, v in model_update.items()}
            shares.append(share)

        final_share = {}
        for key in model_update.keys():
            final_share[key] = model_update[key] - sum(s[key] for s in shares)

        shares.append(final_share)
        return shares

    def aggregate_shares(self, client_shares):
        """Aggregate secret shares to recover sum without revealing individual updates."""
        aggregated = {}
        first_client = client_shares[0]

        for key in first_client.keys():
            aggregated[key] = sum(client[key] for client in client_shares)

        return aggregated


# Usage example
secure_agg = SecureAggregation(num_clients=5)
update = {'embeddings': torch.randn(100, 128)}
shares = secure_agg.add_secret_shares(update)
reconstructed = secure_agg.aggregate_shares(shares)
print(f"Secure aggregation with {secure_agg.num_clients} clients")
Secure aggregation with 5 clients
TipFederated Learning vs. Centralized

Use federated learning when:

  • Data cannot be centralized (privacy, regulations, size)
  • Multiple organizations want to collaborate without sharing data
  • Data is naturally distributed (mobile devices, edge servers)

Use centralized learning when:

  • Data can be legally and practically centralized
  • Single organization owns all data
  • Communication costs are prohibitive
  • Need fastest possible training
WarningCommunication Bottleneck

Federated learning requires multiple rounds of communication between clients and server. For large models, this can be slower than centralized training even though computation is distributed. Optimize communication:

  1. Model compression: Send compressed updates (quantization, sparsification)
  2. Fewer rounds: More local epochs per round
  3. Client sampling: Not all clients participate each round
  4. Asynchronous updates: Don’t wait for slowest client

18.6 Key Takeaways

  • Hierarchical embeddings in hyperbolic space preserve taxonomic structure with 20-100x lower dimensionality than Euclidean embeddings, essential for product catalogs, knowledge graphs, and organizational structures

  • Dynamic embeddings capture temporal evolution of entities, critical for user preferences, document relevance, and any domain where meanings shift over time

  • Compositional embeddings explicitly model multi-component entities (products with categories/brands/reviews, documents with title/body/metadata), learning task-specific combination strategies

  • Uncertainty quantification provides confidence scores for embedding-based decisions, essential for high-stakes applications in healthcare, finance, and autonomous systems where knowing when not to trust a prediction is as important as the prediction itself

  • Federated learning enables training embeddings across data silos without centralizing data, crucial for privacy-sensitive domains like healthcare, finance, and cross-organizational collaboration

  • Advanced techniques are not always necessary—use them when your application has specific requirements (hierarchy, temporal dynamics, privacy constraints) that standard embeddings cannot address

  • Production deployment requires careful engineering: streaming updates for dynamic embeddings, calibration for uncertainty, secure communication for federated learning

18.7 Looking Ahead

This concludes Part II on Custom Embedding Development. We’ve progressed from basic custom embeddings (Chapter 14) through sophisticated training techniques (contrastive learning, Siamese networks, self-supervised learning) to advanced methods for specialized scenarios.

Part III begins with Chapter 19, shifting focus from developing embeddings to deploying them in production. We’ll explore MLOps practices, real-time vs. batch processing, versioning strategies, and monitoring embedding systems at scale.

18.8 Further Reading

18.8.1 Hierarchical Embeddings

  • Nickel & Kiela (2017). “Poincaré Embeddings for Learning Hierarchical Representations.” NeurIPS.
  • Sala et al. (2018). “Representation Tradeoffs for Hyperbolic Embeddings.” ICML.
  • Dhingra et al. (2018). “Embedding Text in Hyperbolic Spaces.” Workshop on Structured Prediction for NLP.

18.8.2 Dynamic Embeddings

  • Rudolph & Blei (2018). “Dynamic Embeddings for Language Evolution.” WWW.
  • Yao et al. (2018). “Dynamic Word Embeddings for Evolving Semantic Discovery.” WSDM.
  • Trivedi et al. (2019). “DyRep: Learning Representations over Dynamic Graphs.” ICLR.

18.8.3 Compositional Embeddings

  • Mitchell & Lapata (2010). “Composition in Distributional Models of Semantics.” Cognitive Science.
  • Socher et al. (2013). “Recursive Deep Models for Semantic Compositionality.” EMNLP.
  • Yu & Dredze (2015). “Learning Composition Models for Phrase Embeddings.” TACL.

18.8.4 Uncertainty Quantification

  • Kendall & Gal (2017). “What Uncertainties Do We Need in Bayesian Deep Learning?” NeurIPS.
  • Lakshminarayanan et al. (2017). “Simple and Scalable Predictive Uncertainty Estimation.” NeurIPS.
  • Malinin & Gales (2018). “Predictive Uncertainty Estimation via Prior Networks.” NeurIPS.

18.8.5 Federated Learning

  • McMahan et al. (2017). “Communication-Efficient Learning of Deep Networks from Decentralized Data.” AISTATS.
  • Li et al. (2020). “Federated Optimization in Heterogeneous Networks.” MLSys.
  • Kairouz et al. (2021). “Advances and Open Problems in Federated Learning.” Foundations and Trends in Machine Learning.
  • Abadi et al. (2016). “Deep Learning with Differential Privacy.” CCS.