Appendix: Notebooks and Sample Data - Observability Anomaly Detection

Problem: You want to follow along with hands-on code without reading through all the tutorial text.

Solution: This appendix provides Jupyter notebooks and sample OCSF data that you can run immediately.

Quick Start¶

Download the notebooks and sample data below
Install dependencies: pip install pandas numpy torch scikit-learn matplotlib pyarrow
Open the notebooks in Jupyter and run cells

Download Sample Data¶

Pre-generated OCSF data (~3 hours of synthetic observability events):

Sample data files are in the data/ directory.

Contents:

ocsf_logs.parquet - ~27,000 application log events (59 columns)
ocsf_traces.parquet - ~2,800 distributed trace spans (17 columns)
ocsf_metrics.parquet - ~7,000 metric data points (33 columns)
ocsf_eval_subset.parquet - 1,000 labeled events for evaluation (~2% anomaly rate)

Anomaly types in the data:

Cache miss storms
Database timeouts
Memory leaks
Slow queries

Notebooks¶

View the executed notebooks with output, or download to run yourself:

Notebook	Description	Prerequisites
Feature Engineering	Load OCSF data, extract temporal features, encode for ML	Sample data
Self-Supervised Training	Train TabularResNet with contrastive learning	Part 3 output
Embedding Evaluation	Evaluate embedding quality with metrics and visualization	Part 4 output
Anomaly Detection	Compare k-NN, LOF, Isolation Forest detection	Part 4 output
Model Inference	Load trained model and generate embeddings for new data	Part 4 output

Notebook source files are in the notebooks/ directory.

Notebook Workflow¶

What Each Notebook Covers¶

03-feature-engineering.md¶

Goal: Transform raw OCSF data into feature arrays for TabularResNet.

Key steps:

Load OCSF parquet data
Explore schema (59 columns with nested objects flattened)
Extract temporal features (hour, day, cyclical sin/cos encoding)
Select categorical and numerical feature subsets
Handle missing values
Encode with LabelEncoder + StandardScaler

Output: numerical_features.npy, categorical_features.npy, feature_artifacts.pkl

04-self-supervised-training.md¶

Goal: Train embeddings on unlabeled OCSF data using contrastive learning.

Key steps:

Load processed features from Part 3
Build TabularResNet model (categorical embeddings + residual blocks)
Implement SimCLR-style contrastive loss with data augmentation
Train for 20 epochs
Extract embeddings for all records
Visualize with t-SNE

Output: embeddings.npy, tabular_resnet.pt

05-embedding-evaluation.md¶

Goal: Evaluate embedding quality before using them for anomaly detection.

Key steps:

Load embeddings from Part 4
Visualize with t-SNE (project 128-dim → 2D)
Compute cluster quality metrics (Silhouette, Davies-Bouldin, Calinski-Harabasz)
Find optimal number of clusters (k=2 to k=7)
Inspect nearest neighbors to verify semantic similarity
Generate comprehensive quality report

Output: Quality metrics, visualizations, production readiness verdict

06-anomaly-detection.md¶

Goal: Detect anomalies using multiple algorithms and compare performance.

Key steps:

Load embeddings from Part 4
k-NN distance-based detection (average distance to neighbors)
Local Outlier Factor (density-based)
Isolation Forest (tree-based)
Ensemble voting (2/3 agreement)
Evaluate on labeled subset (if available)
Inspect top anomalies

Output: anomaly_predictions.parquet

07-model-inference.md¶

Goal: Load the trained model and generate embeddings for new OCSF data.

Key steps:

Load saved model weights and feature artifacts
Create inference pipeline for new data
Preprocess new OCSF events (same encoding as training)
Generate embeddings using trained TabularResNet
Package model for production deployment

Output: inference_package.pt (model + preprocessing in one file)

Requirements¶

pandas>=1.5.0
numpy>=1.21.0
torch>=2.0.0
scikit-learn>=1.0.0
matplotlib>=3.5.0
pyarrow>=10.0.0

Install with:

pip install pandas numpy torch scikit-learn matplotlib pyarrow

Converting MyST Markdown to Jupyter Notebooks¶

The notebooks in this repository are written in MyST Markdown format (.md files with executable code cells). To convert them to standard Jupyter notebooks (.ipynb):

Option 1: Using jupytext (recommended)

# Install jupytext
pip install jupytext

# Convert a single notebook
jupytext --to ipynb notebooks/03-feature-engineering.md

# Convert all notebooks
jupytext --to ipynb notebooks/*.md

Option 2: Using MyST CLI

# Install mystmd
npm install -g mystmd

# Build notebooks (creates .ipynb in _build/)
myst build --execute

Why MyST Markdown?

Version control friendly (clean diffs)
Renders directly on GitHub
Supports rich content (admonitions, cross-references, citations)
Executes during site build for always-fresh output

Alternative: Run with Docker¶

Don’t want to install Python locally? Use the official PyTorch Jupyter image:

# From the directory containing notebooks/ and data/
docker run -it -p 8889:8888 -v "${PWD}":/home/jovyan/work quay.io/jupyter/pytorch-notebook

Then open http://localhost:8889 in your browser and navigate to work/notebooks/.

Generating Your Own Data¶

Want more data or different anomaly scenarios? See Appendix: Generating Training Data for a Docker Compose stack that generates realistic observability data.

Summary¶

This appendix provides everything needed to run the tutorial hands-on:

Sample data: Pre-generated OCSF parquet files with ~27K events
Notebooks: Five Jupyter notebooks covering the core workflow
No setup required: Just download, install dependencies, and run

Workflow: Feature Engineering → Self-Supervised Training → Embedding Evaluation → Anomaly Detection