Embedding-Based Anomaly Detection for Observability

A comprehensive 9-part tutorial series on building production-ready anomaly detection systems using ResNet embeddings for OCSF (Open Cybersecurity Schema Framework) observability data.

What you’ll learn: How to build, train, and deploy a custom embedding model (TabularResNet) specifically designed for OCSF observability data. This model transforms observability logs and system metrics into vector representations. Anomaly detection happens entirely through vector database similarity search—no separate detection model needed. The system processes streaming OCSF events in near real-time to automatically identify unusual behavior.

Series Overview¶

This tutorial series takes you from ResNet fundamentals to deploying and monitoring a complete anomaly detection system in production. You’ll learn how to:

Build and train a custom TabularResNet embedding model using self-supervised learning on unlabeled OCSF logs
Deploy the custom embedding model as a FastAPI service for near real-time inference
Store embeddings in a vector database for fast k-NN similarity search
Detect anomalies purely through vector DB operations (k-NN distance scoring—no classical DL detection model)
Monitor embedding quality and trigger automated retraining of the embedding model when drift is detected

Target Audience: ML engineers, operations engineers, and data scientists working with observability data

Applicability: While this series uses OCSF observability logs as the running example, the TabularResNet embedding approach applies to any structured observability data:

Telemetry/Metrics: Time-series data (CPU%, memory, latency) with metadata (host, service, region) → convert to tabular rows
Configuration data: Key-value pairs, settings, deployment configs → naturally tabular
Distributed traces: Span attributes (service, duration, status_code, error) → tabular features per span
Application logs: JSON logs, syslog, custom formats → any structured schema works

The key requirement: Your data can be represented as rows with categorical and numerical features. If you can create a pandas DataFrame from your data, you can use this approach.

Prerequisites:

Basic Python and PyTorch
Understanding of neural networks (or complete our Neural Networks From Scratch series first)

Key Terms (explained in detail throughout the series):

Embeddings: Dense numerical vectors that capture the essence of complex data (like converting an observability event into a list of numbers)
Self-supervised learning: Training a model without labeled data by creating learning tasks from the data itself
Vector database: A specialized database for storing and quickly searching through embeddings based on similarity
ResNet: A deep learning architecture that uses “residual connections” to train very deep networks effectively

Why OCSF?

Without OCSF, you would need separate models for each log format:

AWS CloudTrail: eventSource, eventName, userIdentity.arn
Okta: actor.displayName, outcome.result, target[].type
Linux auditd: syscall, exe, auid, comm

With OCSF, all sources map to the same schema (class_uid, activity_id, actor.user.name), enabling one embedding model to work across all OCSF-compliant sources.

Tutorial Series¶

Part 1: Understanding ResNet Architecture

Learn the core concepts behind Residual Networks:

The degradation problem in deep networks
Skip connections and why they work
Gradient flow visualization
Architecture patterns (basic and bottleneck blocks)

Foundation · 35 min read

Part 2: Adapting ResNet for Tabular Data

Adapt ResNet for observability data:

Replace convolutions with linear layers
Categorical embeddings for high-cardinality features
Complete TabularResNet implementation
Design considerations for OCSF data

Architecture · 30 min read

Part 3: Feature Engineering for OCSF Data

Transform OCSF JSON to model input:

Flattening nested JSON structures
Temporal and derived features
Aggregation and rolling windows
High cardinality handling
End-to-end feature pipeline

Data Engineering · 40 min read

Part 4: Self-Supervised Training

Train on unlabelled data:

Masked Feature Prediction (MFP)
Contrastive learning with augmentation
Complete training pipeline
Hyperparameter tuning strategies

Training · 35 min read

Part 5: Evaluating Embedding Quality

Validate embedding quality before deployment:

t-SNE and UMAP visualization
Cluster quality metrics (Silhouette, Davies-Bouldin)
Embedding robustness testing
Production readiness checklist

Verification · 30 min read

Part 6: Anomaly Detection Methods

Apply detection algorithms:

Local Outlier Factor (LOF)
Isolation Forest
Distance-based methods
Sequence anomaly detection (LSTMs)
Method comparison framework

Detection · 40 min read

Part 7: Production Deployment

Deploy to production:

REST API with FastAPI
Docker containerization
Model versioning with MLflow
A/B testing framework
Real-time vs batch inference

Deployment · 45 min read

Part 8: Production Monitoring

Monitor and maintain the system:

Embedding drift detection
Alert quality metrics
Automated retraining triggers
Incident response tools
Cost optimization

Monitoring · 35 min read

Part 9: Multi-Source Correlation

Extend to multiple data sources for root cause analysis:

Training separate models for logs, metrics, traces, config
Unified vector database with metadata tags
Temporal correlation across sources
Causal graph construction
Automated root cause ranking

Advanced · 50 min read

Complete Series

Total: ~6 hours of comprehensive, hands-on content

All code examples are executable and production-ready.

Appendices¶

Appendix: Notebooks & Sample Data

Run the tutorial hands-on:

Pre-generated OCSF sample data
Jupyter notebooks for Parts 3-6
Docker one-liner for Jupyter environment
No setup required—just download and run

Hands-on · 5 min setup

Appendix: Generating Training Data

Generate your own data:

Docker Compose stack with web-api, auth, payment services
OpenTelemetry for unified telemetry collection
Load generator with anomaly scenarios
OCSF converter for logs, traces, metrics

Data Generation · 15 min setup

Notebook: Feature Engineering

Load OCSF data and extract features:

Parse parquet files with pandas
Build categorical and numerical feature sets
Create training-ready datasets

Hands-on · Part 3

Notebook: Self-Supervised Training

Train TabularResNet with contrastive learning:

Implement data augmentation strategies
Configure training hyperparameters
Monitor training progress

Hands-on · Part 4

Notebook: Embedding Evaluation

Evaluate embedding quality before deployment:

t-SNE/UMAP visualization
Cluster quality metrics (Silhouette, Davies-Bouldin)
Nearest neighbor inspection
Production readiness report

Hands-on · Part 5

Notebook: Model Inference

Load trained model and generate embeddings:

Save and load model checkpoints
Run inference on new data
Extract embedding vectors

Hands-on · Part 7

Notebook: Anomaly Detection

Compare anomaly detection methods:

k-NN distance scoring
Local Outlier Factor (LOF)
Isolation Forest

Hands-on · Part 6

What You’ll Build¶

By the end of this series, you’ll have:

Custom TabularResNet Embedding Model: Trained from scratch on your OCSF data using self-supervised learning
Embedding Service: FastAPI REST API that serves the custom TabularResNet model, generating embeddings for OCSF events via HTTP requests
Vector Database: Stores embeddings and performs k-NN similarity search at scale
Vector-Based Anomaly Detection: Detection through pure vector DB operations (k-NN distance, density)—no classical DL detection model
Monitoring & Alerting: Track embedding drift, detection quality, and system health
Automated Retraining: Triggers retraining of the custom embedding model based on drift and performance degradation

Optional Extension (Part 9): For advanced production deployments, extend the system to correlate anomalies across multiple observability data sources (logs, metrics, traces, configuration changes) for automated root cause analysis.

System Architecture¶

This diagram shows the complete end-to-end system you’ll build. OCSF events stream in near real-time through the following pipeline:

Preprocessing: Extract and normalize features from each OCSF event
Embedding generation: TabularResNet (the only ML model) generates a vector for each event
Vector DB storage: Embeddings are indexed for fast k-NN similarity search
Anomaly scoring: Simple code logic computes scores using vector DB distances—NOT a separate ML model, just threshold-based calculations
Alerting: Trigger alerts for high-scoring anomalies

The monitoring components (shown in red/purple) continuously track embedding drift and system health, triggering automatic retraining of the embedding model when needed.

Key architectural point:

What we deploy: A custom TabularResNet embedding model trained on your OCSF data
What we DON’T deploy: A classical DL model for anomaly detection (no separate classifier, predictor, or scoring model)
How detection works: Pure vector database operations (k-NN distance calculations, density estimation)

Diagram legend:

Solid arrows (→): Near real-time data flow for each OCSF event
Dotted arrows (⇢): Monitoring and feedback loops (periodic checks)
Colors: Blue=Data input, Green=Embedding model (only ML model), Yellow=Vector storage, Orange=Scoring logic (not a model), Red/Purple=Monitoring

Key Concepts¶

Why ResNet for Tabular Data?¶

Research by Gorishniy et al. (2021) found that ResNet:

Competes with Transformers on tabular benchmarks
Simpler architecture: No attention mechanism
Better efficiency: O(n·d) vs O(d²) complexity
Strong baseline: Try before complex models

Why Embeddings for Anomaly Detection?¶

Embeddings compress high-dimensional OCSF data (300+ fields) into dense vectors that:

Capture semantic relationships
Enable efficient distance calculations
Support multiple detection algorithms
Generalize to new anomaly types

Why a Vector Database?¶

A vector database makes similarity search the central mechanism for anomaly detection by:

Storing and indexing embeddings for fast nearest-neighbor queries
Enabling k-NN distance scoring, density estimation, and thresholding at scale
Supporting incremental updates as new normal behavior arrives
Providing consistent retrieval for both batch and near real-time pipelines

Code Repository¶

All code from this series is available in executable notebooks. Each part includes:

Runnable code cells: Test concepts immediately
Visualizations: Understand embeddings and anomalies
Production examples: Real-world deployment patterns

Prerequisites¶

Neural Networks From Scratch - Learn NN fundamentals

Alternating Least Squares (ALS) - Matrix factorization
Latent Factors - Understanding embeddings
Softmax - From scores to probabilities

Get Started¶

Ready to build your anomaly detection system? Start with Part 1: Understanding ResNet Architecture!

Want to jump straight to hands-on code? See Appendix: Notebooks & Sample Data to download notebooks and sample data.

Series Overview¶

Tutorial Series¶

Appendices¶

What You’ll Build¶

System Architecture¶

Key Concepts¶

Why ResNet for Tabular Data?¶

Why Embeddings for Anomaly Detection?¶

Why a Vector Database?¶

Code Repository¶

Prerequisites¶

Get Started¶

Further Reading¶

References¶

Embedding-Based Anomaly Detection for Observability

Series Overview¶

Tutorial Series¶

Appendices¶

What You’ll Build¶

System Architecture¶

Key Concepts¶

Why ResNet for Tabular Data?¶

Why Embeddings for Anomaly Detection?¶

Why a Vector Database?¶

Code Repository¶

Related Content¶

Prerequisites¶

Related Tutorials¶

Get Started¶

Further Reading¶

References¶