Problem: You want to follow along with hands-on code without reading through all the tutorial text.
Solution: This appendix provides Jupyter notebooks and sample OCSF data that you can run immediately.
Quick Start¶
Download the notebooks and sample data below
Install dependencies:
pip install pandas numpy torch scikit-learn matplotlib pyarrowOpen the notebooks in Jupyter and run cells
Download Sample Data¶
Pre-generated OCSF data (~3 hours of synthetic observability events):
Sample data files are in the data/ directory.
Contents:
ocsf_logs.parquet- ~27,000 application log events (59 columns)ocsf_traces.parquet- ~2,800 distributed trace spans (17 columns)ocsf_metrics.parquet- ~7,000 metric data points (33 columns)ocsf_eval_subset.parquet- 1,000 labeled events for evaluation (~2% anomaly rate)
Anomaly types in the data:
Cache miss storms
Database timeouts
Memory leaks
Slow queries
Notebooks¶
View the executed notebooks with output, or download to run yourself:
| Notebook | Description | Prerequisites |
|---|---|---|
| Feature Engineering | Load OCSF data, extract temporal features, encode for ML | Sample data |
| Self-Supervised Training | Train TabularResNet with contrastive learning | Part 3 output |
| Embedding Evaluation | Evaluate embedding quality with metrics and visualization | Part 4 output |
| Anomaly Detection | Compare k-NN, LOF, Isolation Forest detection | Part 4 output |
| Model Inference | Load trained model and generate embeddings for new data | Part 4 output |
Notebook source files are in the notebooks/ directory.
Notebook Workflow¶
What Each Notebook Covers¶
03-feature-engineering.md¶
Goal: Transform raw OCSF data into feature arrays for TabularResNet.
Key steps:
Load OCSF parquet data
Explore schema (59 columns with nested objects flattened)
Extract temporal features (hour, day, cyclical sin/cos encoding)
Select categorical and numerical feature subsets
Handle missing values
Encode with LabelEncoder + StandardScaler
Output: numerical_features.npy, categorical_features.npy, feature_artifacts.pkl
04-self-supervised-training.md¶
Goal: Train embeddings on unlabeled OCSF data using contrastive learning.
Key steps:
Load processed features from Part 3
Build TabularResNet model (categorical embeddings + residual blocks)
Implement SimCLR-style contrastive loss with data augmentation
Train for 20 epochs
Extract embeddings for all records
Visualize with t-SNE
Output: embeddings.npy, tabular_resnet.pt
05-embedding-evaluation.md¶
Goal: Evaluate embedding quality before using them for anomaly detection.
Key steps:
Load embeddings from Part 4
Visualize with t-SNE (project 128-dim → 2D)
Compute cluster quality metrics (Silhouette, Davies-Bouldin, Calinski-Harabasz)
Find optimal number of clusters (k=2 to k=7)
Inspect nearest neighbors to verify semantic similarity
Generate comprehensive quality report
Output: Quality metrics, visualizations, production readiness verdict
06-anomaly-detection.md¶
Goal: Detect anomalies using multiple algorithms and compare performance.
Key steps:
Load embeddings from Part 4
k-NN distance-based detection (average distance to neighbors)
Local Outlier Factor (density-based)
Isolation Forest (tree-based)
Ensemble voting (2/3 agreement)
Evaluate on labeled subset (if available)
Inspect top anomalies
Output: anomaly_predictions.parquet
07-model-inference.md¶
Goal: Load the trained model and generate embeddings for new OCSF data.
Key steps:
Load saved model weights and feature artifacts
Create inference pipeline for new data
Preprocess new OCSF events (same encoding as training)
Generate embeddings using trained TabularResNet
Package model for production deployment
Output: inference_package.pt (model + preprocessing in one file)
Requirements¶
pandas>=1.5.0
numpy>=1.21.0
torch>=2.0.0
scikit-learn>=1.0.0
matplotlib>=3.5.0
pyarrow>=10.0.0Install with:
pip install pandas numpy torch scikit-learn matplotlib pyarrowConverting MyST Markdown to Jupyter Notebooks¶
The notebooks in this repository are written in MyST Markdown format (.md files with executable code cells). To convert them to standard Jupyter notebooks (.ipynb):
Option 1: Using jupytext (recommended)
# Install jupytext
pip install jupytext
# Convert a single notebook
jupytext --to ipynb notebooks/03-feature-engineering.md
# Convert all notebooks
jupytext --to ipynb notebooks/*.mdOption 2: Using MyST CLI
# Install mystmd
npm install -g mystmd
# Build notebooks (creates .ipynb in _build/)
myst build --executeWhy MyST Markdown?
Version control friendly (clean diffs)
Renders directly on GitHub
Supports rich content (admonitions, cross-references, citations)
Executes during site build for always-fresh output
Alternative: Run with Docker¶
Don’t want to install Python locally? Use the official PyTorch Jupyter image:
# From the directory containing notebooks/ and data/
docker run -it -p 8889:8888 -v "${PWD}":/home/jovyan/work quay.io/jupyter/pytorch-notebookThen open http://localhost:8889 in your browser and navigate to work/notebooks/.
Generating Your Own Data¶
Want more data or different anomaly scenarios? See Appendix: Generating Training Data for a Docker Compose stack that generates realistic observability data.
Summary¶
This appendix provides everything needed to run the tutorial hands-on:
Sample data: Pre-generated OCSF parquet files with ~27K events
Notebooks: Five Jupyter notebooks covering the core workflow
No setup required: Just download, install dependencies, and run
Workflow: Feature Engineering → Self-Supervised Training → Embedding Evaluation → Anomaly Detection