Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

This tutorial series builds a production-ready anomaly detection system using ResNet embeddings for observability data.

Introduction: Why ResNet for Anomaly Detection?

This tutorial explores Residual Networks (ResNet), a breakthrough architecture that enables training of very deep neural networks. While ResNet was originally designed for computer vision, it has emerged as a surprisingly strong baseline for tabular data and embedding models.

Motivation: Tabular Data and Anomaly Detection

Recent research has shown that while Transformers (TabTransformer, FT-Transformer) achieve state-of-the-art results on tabular data, ResNet-like architectures provide a simpler, more efficient baseline that often performs comparably well Gorishniy et al. (2021). For applications like:

ResNet offers several advantages:

  1. Simpler architecture than Transformers (no attention mechanism overhead)

  2. Linear complexity vs. quadratic attention complexity for high-dimensional tabular data (300+ features)

  3. Strong empirical performance on heterogeneous tabular datasets

  4. Efficient embedding extraction for downstream clustering and anomaly detection

The Use Case: OCSF Observability Data

Consider an observability scenario where you have:

The approach Huang et al. (2020):

  1. Pre-train a ResNet to create embeddings from individual records

  2. Extract fixed-dimensional vectors that capture “normal” system behavior

  3. Detect anomalies as records/sequences that deviate from learned patterns

This tutorial series will build your understanding of ResNet from first principles, then show how to deploy it in production.

Prerequisites

Required Background:

Recommended (but not required):

New to neural networks? Start with our Neural Networks From Scratch series:

Paper References


The Problem ResNet Solves

The Degradation Problem

Intuitively, deeper neural networks should be more powerful:

What is an identity mapping? A transformation where the output equals the input: f(x)=xf(x) = x. Think of it like a “do nothing” operation - data passes through unchanged. For example, if a layer receives a vector [1, 2, 3], an identity mapping would output exactly [1, 2, 3]. In theory, deeper networks could use identity mappings in extra layers to match shallower networks, but in practice they fail to learn even this simple operation.

But in practice, this doesn’t happen.

<Figure size 1200x400 with 2 Axes>

Key Observation: Beyond a certain depth (~20-30 layers), plain networks start to perform worse on both training and test sets. This isn’t overfitting (training error increases too) — it’s optimization difficulty.

Why Plain Networks Fail

Two main issues:

  1. Vanishing Gradients: As gradients backpropagate through many layers, they get multiplied by small weight matrices repeatedly, shrinking exponentially. Deep layers learn very slowly or not at all.

  2. Degraded Optimization Landscape: Very deep networks create complex, non-convex loss surfaces that are hard for SGD to navigate. Even though a solution exists (copy the shallower network and make extra layers just pass data through unchanged), the optimizer can’t find it.

What We Need

An architecture where:

This is exactly what ResNet provides.


The Core Innovation — Residual Connections

The Residual Block

The Key Idea:

In a traditional neural network, layers learn to transform input x\mathbf{x} into output H(x)H(\mathbf{x}) directly:

ResNet changes this by learning the residual (the difference between output and input):

H(x)=F(x)+xH(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}

Where:

Why this helps:

Visual comparison: The diagrams below show the key architectural difference:

Plain Network Block
Residual Block

Why This Works: Intuition

Learning Identity is Easy:

Gradient Flow:

Source
import logging
import warnings

logging.getLogger("matplotlib.font_manager").setLevel(logging.ERROR)

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

# Simple plain network (no skip connections)
class PlainNetwork(nn.Module):
    def __init__(self, num_layers=10):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(128, 128) for _ in range(num_layers)
        ])

    def forward(self, x):
        for layer in self.layers:
            x = torch.relu(layer(x))
        return x

# Simple residual network (with skip connections)
class ResidualNetwork(nn.Module):
    def __init__(self, num_layers=10):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(128, 128) for _ in range(num_layers)
        ])

    def forward(self, x):
        for layer in self.layers:
            x = torch.relu(layer(x)) + x  # Skip connection!
        return x

# Create networks
plain_net = PlainNetwork(num_layers=10)
resnet = ResidualNetwork(num_layers=10)

# Dummy input and target
x = torch.randn(32, 128)
target = torch.randn(32, 128)

# Forward + backward for plain network
plain_output = plain_net(x)
plain_loss = ((plain_output - target) ** 2).mean()
plain_loss.backward()

# Collect gradient magnitudes for each layer
plain_grads = []
for layer in plain_net.layers:
    if layer.weight.grad is not None:
        plain_grads.append(layer.weight.grad.abs().mean().item())

# Reset and do the same for ResNet
resnet.zero_grad()
resnet_output = resnet(x)
resnet_loss = ((resnet_output - target) ** 2).mean()
resnet_loss.backward()

resnet_grads = []
for layer in resnet.layers:
    if layer.weight.grad is not None:
        resnet_grads.append(layer.weight.grad.abs().mean().item())

# Plot comparison
fig, ax = plt.subplots(figsize=(10, 6))

layers = list(range(1, len(plain_grads) + 1))
ax.plot(layers, plain_grads, 'o-', color='red', linewidth=2,
        markersize=8, label='Plain Network')
ax.plot(layers, resnet_grads, 's-', color='blue', linewidth=2,
        markersize=8, label='ResNet (with skip connections)')

ax.set_xlabel('Layer Depth (1 = earliest layer)', fontsize=12, fontweight='bold')
ax.set_ylabel('Gradient Magnitude', fontsize=12, fontweight='bold')
ax.set_title('Gradient Flow Comparison: Plain vs ResNet\n(Lower layers = earlier in network)',
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_yscale('log')  # Log scale to show exponential decay

# Add annotations
ax.annotate('Vanishing gradients\nin plain network',
            xy=(2, plain_grads[1]), xytext=(3, plain_grads[1]*10),
            arrowprops=dict(arrowstyle='->', color='red', lw=1.5),
            fontsize=10, color='red')
ax.annotate('Strong gradients maintained\nvia skip connections',
            xy=(2, resnet_grads[1]), xytext=(5, resnet_grads[1]*0.1),
            arrowprops=dict(arrowstyle='->', color='blue', lw=1.5),
            fontsize=10, color='blue')

plt.tight_layout()
plt.show()

print("Observation:")
print(f"  Plain Network - Layer 1 gradient: {plain_grads[0]:.6f}")
print(f"  Plain Network - Layer 10 gradient: {plain_grads[-1]:.6f}")
print(f"  Ratio (layer 10 / layer 1): {plain_grads[-1] / plain_grads[0]:.6f}")
print()
print(f"  ResNet - Layer 1 gradient: {resnet_grads[0]:.6f}")
print(f"  ResNet - Layer 10 gradient: {resnet_grads[-1]:.6f}")
print(f"  Ratio (layer 10 / layer 1): {resnet_grads[-1] / resnet_grads[0]:.6f}")
print()
print("Key insight: ResNet maintains much stronger gradients in early layers,")
print("enabling effective training of deep networks.")
<Figure size 1000x600 with 1 Axes>
Observation:
  Plain Network - Layer 1 gradient: 0.000000
  Plain Network - Layer 10 gradient: 0.000022
  Ratio (layer 10 / layer 1): 49.664504

  ResNet - Layer 1 gradient: 0.084232
  ResNet - Layer 10 gradient: 0.204147
  Ratio (layer 10 / layer 1): 2.423623

Key insight: ResNet maintains much stronger gradients in early layers,
enabling effective training of deep networks.

What you’re seeing:

Why this matters for training: Without strong gradients in early layers, those layers barely update during training, making deep plain networks fail to learn effectively. ResNets solve this.


Building Blocks: Basic vs Bottleneck

ResNet comes in several standard variants with different depths: ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152. The number indicates the total layer count (e.g., ResNet-50 has 50 layers total). These variants use two different types of residual blocks:

Let’s understand the difference between these two block types.

Basic Block (2 Layers)

Used in ResNet-18 and ResNet-34. Simple structure with 2 layers:

Architecture: Input → Layer 1 → Layer 2 → Add skip connection → Output

Skip connection: H(x)=F(x)+xH(x) = F(x) + x where F(x)F(x) is computed through 2 layers

When to use: Shallower networks (18-34 layers) where parameter count isn’t a concern

Bottleneck Block (3 Layers)

Used in ResNet-50, ResNet-101, and ResNet-152. Optimized structure using reduce-compute-expand:

Architecture: Input → Reduce → Compute → Expand → Add skip connection → Output

The pattern:

  1. Reduce dimensions (e.g., 25664256 \rightarrow 64 features) - cheap, small transformation

  2. Compute on reduced dimensions - expensive operations, but on fewer features

  3. Expand back to original size (e.g., 6425664 \rightarrow 256 features) - cheap, small transformation

Parameter savings: For 256-dimensional features:

When to use: Deeper networks (50+ layers) where parameter efficiency is critical

Why This Works: Intuition

The bottleneck design is based on a key insight: most of the useful computation can happen in a lower-dimensional space.

Analogy: Think of data compression. A high-resolution image can be compressed to a smaller size, processed efficiently, then decompressed back. The compressed version still captures the essential information needed for processing.

What’s happening:

  1. Reduce (1×1 conv/linear): Projects high-dimensional features into a compact representation that captures the essential patterns. Like compressing an image before editing it.

  2. Compute (3×3 conv/linear): Performs the expensive transformations on this compact representation. Since we’re working with fewer dimensions (64 instead of 256), this is much cheaper.

  3. Expand (1×1 conv/linear): Projects back to the original high-dimensional space. Like decompressing after processing.

Why it doesn’t hurt performance: The network learns to project into a lower-dimensional subspace where the meaningful transformations happen. The high dimensionality at input/output provides representational capacity, but the actual computation happens efficiently in the bottleneck.

Concrete example (256-dim features):

The bottleneck achieves similar representational power with far fewer parameters by concentrating computation in a lower-dimensional space.

Visual Comparison

Basic Block (2 layers):

Bottleneck Block (3 layers):

Diagram explanation:


ResNet Architecture Overview

A full ResNet stacks these blocks into a multi-stage architecture:

Architecture components:

  1. Initial feature extraction: Transform raw input into initial feature representation

  2. Residual stages: Groups of residual blocks, typically 4 stages with increasing feature dimensions (64→128→256→512)

  3. Aggregation: Pool/reduce features to fixed size

  4. Output head: Final layer(s) for the task (classification, embedding, etc.)

Standard architectures:

Key pattern: Each stage typically:

Why this works: The deep stack of residual stages allows the network to learn increasingly abstract representations (from simple patterns to complex concepts), while skip connections ensure gradient flow to all layers.


Summary

In this part, you learned the core concepts behind Residual Networks:

  1. The degradation problem: Why plain deep networks fail to train effectively

  2. Residual connections (H(x)=F(x)+xH(x) = F(x) + x): The skip connection that solves the problem

  3. Why it works:

    • Makes learning identity mappings easy (F(x)=0F(x) = 0)

    • Enables gradient flow through skip connections (the “gradient highway”)

    • Empirically verified with gradient magnitude visualization

  4. Architecture patterns:

    • Basic blocks (2 layers) for shallower networks (ResNet-18/34)

    • Bottleneck blocks (3 layers) for deeper networks (ResNet-50+)

    • Multi-stage design with increasing feature dimensions

Key takeaway: The skip connection is a simple but powerful innovation that makes deep networks trainable. The concept works across all domains—images, text, tabular data—making ResNet one of the most versatile architectures in deep learning.

Next: In Part 2: TabularResNet, you’ll see how to apply these concepts to tabular OCSF data using Linear layers instead of convolutions, plus categorical embeddings for high-cardinality features.


References

References
  1. Gorishniy, Y., Rubachev, I., Khrulkov, V., & Babenko, A. (2021). Revisiting Deep Learning Models for Tabular Data. Advances in Neural Information Processing Systems (NeurIPS), 34, 18932–18943.
  2. Huang, X., Khetan, A., Cvitkovic, M., & Karnin, Z. (2020). TabTransformer: Tabular Data Modeling Using Contextual Embeddings. arXiv Preprint arXiv:2012.06678.