Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Problem: You want to follow along with hands-on code without reading through all the tutorial text.

Solution: This appendix provides Jupyter notebooks and sample OCSF data that you can run immediately.


Quick Start

  1. Download the notebooks and sample data below

  2. Install dependencies: pip install pandas numpy torch scikit-learn matplotlib pyarrow

  3. Open the notebooks in Jupyter and run cells


Download Sample Data

Pre-generated OCSF data (~3 hours of synthetic observability events):

Sample data files are in the data/ directory.

Contents:

Anomaly types in the data:


Notebooks

View the executed notebooks with output, or download to run yourself:

NotebookDescriptionPrerequisites
Feature EngineeringLoad OCSF data, extract temporal features, encode for MLSample data
Self-Supervised TrainingTrain TabularResNet with contrastive learningPart 3 output
Embedding EvaluationEvaluate embedding quality with metrics and visualizationPart 4 output
Anomaly DetectionCompare k-NN, LOF, Isolation Forest detectionPart 4 output
Model InferenceLoad trained model and generate embeddings for new dataPart 4 output

Notebook source files are in the notebooks/ directory.


Notebook Workflow


What Each Notebook Covers

03-feature-engineering.md

Goal: Transform raw OCSF data into feature arrays for TabularResNet.

Key steps:

  1. Load OCSF parquet data

  2. Explore schema (59 columns with nested objects flattened)

  3. Extract temporal features (hour, day, cyclical sin/cos encoding)

  4. Select categorical and numerical feature subsets

  5. Handle missing values

  6. Encode with LabelEncoder + StandardScaler

Output: numerical_features.npy, categorical_features.npy, feature_artifacts.pkl


04-self-supervised-training.md

Goal: Train embeddings on unlabeled OCSF data using contrastive learning.

Key steps:

  1. Load processed features from Part 3

  2. Build TabularResNet model (categorical embeddings + residual blocks)

  3. Implement SimCLR-style contrastive loss with data augmentation

  4. Train for 20 epochs

  5. Extract embeddings for all records

  6. Visualize with t-SNE

Output: embeddings.npy, tabular_resnet.pt


05-embedding-evaluation.md

Goal: Evaluate embedding quality before using them for anomaly detection.

Key steps:

  1. Load embeddings from Part 4

  2. Visualize with t-SNE (project 128-dim → 2D)

  3. Compute cluster quality metrics (Silhouette, Davies-Bouldin, Calinski-Harabasz)

  4. Find optimal number of clusters (k=2 to k=7)

  5. Inspect nearest neighbors to verify semantic similarity

  6. Generate comprehensive quality report

Output: Quality metrics, visualizations, production readiness verdict


06-anomaly-detection.md

Goal: Detect anomalies using multiple algorithms and compare performance.

Key steps:

  1. Load embeddings from Part 4

  2. k-NN distance-based detection (average distance to neighbors)

  3. Local Outlier Factor (density-based)

  4. Isolation Forest (tree-based)

  5. Ensemble voting (2/3 agreement)

  6. Evaluate on labeled subset (if available)

  7. Inspect top anomalies

Output: anomaly_predictions.parquet


07-model-inference.md

Goal: Load the trained model and generate embeddings for new OCSF data.

Key steps:

  1. Load saved model weights and feature artifacts

  2. Create inference pipeline for new data

  3. Preprocess new OCSF events (same encoding as training)

  4. Generate embeddings using trained TabularResNet

  5. Package model for production deployment

Output: inference_package.pt (model + preprocessing in one file)


Requirements

pandas>=1.5.0
numpy>=1.21.0
torch>=2.0.0
scikit-learn>=1.0.0
matplotlib>=3.5.0
pyarrow>=10.0.0

Install with:

pip install pandas numpy torch scikit-learn matplotlib pyarrow

Converting MyST Markdown to Jupyter Notebooks

The notebooks in this repository are written in MyST Markdown format (.md files with executable code cells). To convert them to standard Jupyter notebooks (.ipynb):

Option 1: Using jupytext (recommended)

# Install jupytext
pip install jupytext

# Convert a single notebook
jupytext --to ipynb notebooks/03-feature-engineering.md

# Convert all notebooks
jupytext --to ipynb notebooks/*.md

Option 2: Using MyST CLI

# Install mystmd
npm install -g mystmd

# Build notebooks (creates .ipynb in _build/)
myst build --execute

Why MyST Markdown?

Alternative: Run with Docker

Don’t want to install Python locally? Use the official PyTorch Jupyter image:

# From the directory containing notebooks/ and data/
docker run -it -p 8889:8888 -v "${PWD}":/home/jovyan/work quay.io/jupyter/pytorch-notebook

Then open http://localhost:8889 in your browser and navigate to work/notebooks/.


Generating Your Own Data

Want more data or different anomaly scenarios? See Appendix: Generating Training Data for a Docker Compose stack that generates realistic observability data.


Summary

This appendix provides everything needed to run the tutorial hands-on:

  1. Sample data: Pre-generated OCSF parquet files with ~27K events

  2. Notebooks: Five Jupyter notebooks covering the core workflow

  3. No setup required: Just download, install dependencies, and run

Workflow: Feature Engineering → Self-Supervised Training → Embedding Evaluation → Anomaly Detection