Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Theory: See Part 5: Evaluating Embedding Quality for concepts behind embedding evaluation.

Evaluate the quality of your trained embeddings using both quantitative metrics and qualitative inspection.

What you’ll learn:

  1. Visualize embeddings with t-SNE and UMAP

  2. Inspect nearest neighbors to verify semantic similarity

  3. Compute cluster quality metrics (Silhouette, Davies-Bouldin, Calinski-Harabasz)

  4. Determine optimal number of clusters (k)

  5. Test embedding robustness with perturbation stability

  6. Evaluate k-NN classification as a proxy task

  7. Generate a comprehensive quality report

Prerequisites:


Why Evaluate Embeddings?

The challenge: Just because your training loss decreased doesn’t mean your embeddings are useful. A model can memorize training data while learning poor representations.

The solution: Evaluate from multiple angles:

Numbers don’t tell the whole story - you need to look at your data!

import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score, silhouette_samples
from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

# Optional: UMAP for alternative visualization
try:
    import umap
    UMAP_AVAILABLE = True
except ImportError:
    UMAP_AVAILABLE = False

print("✓ All imports successful")
print("\nLibraries loaded:")
print("  - NumPy for numerical operations")
print("  - Pandas for data manipulation")
print("  - Matplotlib for visualization")
print("  - Scikit-learn for clustering and metrics")
print(f"  - UMAP: {'Available' if UMAP_AVAILABLE else 'Not installed (pip install umap-learn)'}")
✓ All imports successful

Libraries loaded:
  - NumPy for numerical operations
  - Pandas for data manipulation
  - Matplotlib for visualization
  - Scikit-learn for clustering and metrics
  - UMAP: Available

1. Load Embeddings

Load the embeddings generated in the self-supervised training notebook.

What you should expect:

If you see errors:

# Load embeddings
embeddings = np.load('../data/embeddings.npy')

# Load original features (for perturbation testing later)
numerical = np.load('../data/numerical_features.npy')
categorical = np.load('../data/categorical_features.npy')

with open('../data/feature_artifacts.pkl', 'rb') as f:
    artifacts = pickle.load(f)

# Load original OCSF data for inspection
ocsf_df = pd.read_parquet('../data/ocsf_logs.parquet')

# Ensure OCSF data matches embeddings length
if len(ocsf_df) != len(embeddings):
    print(f"⚠ Warning: OCSF data has {len(ocsf_df)} rows but embeddings have {len(embeddings)}")
    print(f"  Truncating to match...")
    ocsf_df = ocsf_df.iloc[:len(embeddings)]

print("Loaded Embeddings:")
print(f"  Shape: {embeddings.shape}")
print(f"  Mean: {embeddings.mean():.4f}")
print(f"  Std: {embeddings.std():.4f}")
print(f"  Range: [{embeddings.min():.4f}, {embeddings.max():.4f}]")
print(f"\n✓ No NaN values: {not np.isnan(embeddings).any()}")
print(f"✓ No Inf values: {not np.isinf(embeddings).any()}")

print(f"\nLoaded OCSF Data:")
print(f"  Events: {len(ocsf_df):,}")
print(f"  Columns: {len(ocsf_df.columns)}")
Loaded Embeddings:
  Shape: (27084, 128)
  Mean: 0.0003
  Std: 0.9949
  Range: [-4.1475, 4.4995]

✓ No NaN values: True
✓ No Inf values: True

Loaded OCSF Data:
  Events: 27,084
  Columns: 59

2. Qualitative Evaluation: t-SNE Visualization

What is t-SNE? A dimensionality reduction technique that projects high-dimensional embeddings (192-dim) to 2D while preserving local structure.

What to look for:

Note on coloring: Since our embeddings come from self-supervised learning (unsupervised), we don’t have ground truth class labels. The left plot shows raw structure; the right plot uses embedding norm as an anomaly indicator. Compare with Part 5’s labeled synthetic examples for educational context.

Perplexity parameter: Controls balance between local and global structure

# Sample for visualization (t-SNE is slow on large datasets)
sample_size = min(3000, len(embeddings))
np.random.seed(42)
indices = np.random.choice(len(embeddings), sample_size, replace=False)
emb_sample = embeddings[indices]

print(f"Sampling {sample_size:,} embeddings for t-SNE visualization")
print(f"(Running t-SNE on full dataset would be too slow)")
print(f"\nRunning t-SNE with perplexity=30... (this may take 1-2 minutes)")
Sampling 3,000 embeddings for t-SNE visualization
(Running t-SNE on full dataset would be too slow)

Running t-SNE with perplexity=30... (this may take 1-2 minutes)
# Run t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42, max_iter=1000)
emb_2d = tsne.fit_transform(emb_sample)
print("✓ t-SNE complete!")
✓ t-SNE complete!
# Visualize t-SNE
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Basic scatter
axes[0].scatter(emb_2d[:, 0], emb_2d[:, 1], alpha=0.6, s=20, c='steelblue', edgecolors='none')
axes[0].set_xlabel('t-SNE Dimension 1', fontsize=12)
axes[0].set_ylabel('t-SNE Dimension 2', fontsize=12)
axes[0].set_title('OCSF Event Embeddings (t-SNE)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Colored by embedding norm (potential anomaly indicator)
norms = np.linalg.norm(emb_sample, axis=1)
scatter = axes[1].scatter(emb_2d[:, 0], emb_2d[:, 1], c=norms,
                          cmap='viridis', alpha=0.6, s=20, edgecolors='none')
axes[1].set_xlabel('t-SNE Dimension 1', fontsize=12)
axes[1].set_ylabel('t-SNE Dimension 2', fontsize=12)
axes[1].set_title('Colored by Embedding Norm (Anomaly Indicator)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
cbar = plt.colorbar(scatter, ax=axes[1])
cbar.set_label('L2 Norm', fontsize=11)

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("INTERPRETATION GUIDE")
print("="*60)
print("✓ Look for distinct clusters (similar events group together)")
print("✓ Outliers/sparse regions = potential anomalies")
print("✓ Right plot: Yellow points (high norm) = unusual events")
print("✗ Single blob = poor embeddings, need more training")
print("✗ Random scatter = model didn't learn structure")
<Figure size 1500x600 with 3 Axes>

============================================================
INTERPRETATION GUIDE
============================================================
✓ Look for distinct clusters (similar events group together)
✓ Outliers/sparse regions = potential anomalies
✓ Right plot: Yellow points (high norm) = unusual events
✗ Single blob = poor embeddings, need more training
✗ Random scatter = model didn't learn structure

3. Alternative Visualization: UMAP

What is UMAP? Uniform Manifold Approximation and Projection preserves both local and global structure better than t-SNE. Generally faster and more scalable.

When to use UMAP instead of t-SNE:

Key parameter—n_neighbors: Controls balance between local and global structure (5-50 typical).

if UMAP_AVAILABLE:
    print(f"Running UMAP with n_neighbors=15...")

    # Run UMAP on same sample
    reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
    emb_umap = reducer.fit_transform(emb_sample)

    # Visualize UMAP
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))

    # Basic scatter
    axes[0].scatter(emb_umap[:, 0], emb_umap[:, 1], alpha=0.6, s=20, c='steelblue', edgecolors='none')
    axes[0].set_xlabel('UMAP Dimension 1', fontsize=12)
    axes[0].set_ylabel('UMAP Dimension 2', fontsize=12)
    axes[0].set_title('OCSF Event Embeddings (UMAP)', fontsize=14, fontweight='bold')
    axes[0].grid(True, alpha=0.3)

    # Colored by embedding norm
    scatter = axes[1].scatter(emb_umap[:, 0], emb_umap[:, 1], c=norms,
                              cmap='viridis', alpha=0.6, s=20, edgecolors='none')
    axes[1].set_xlabel('UMAP Dimension 1', fontsize=12)
    axes[1].set_ylabel('UMAP Dimension 2', fontsize=12)
    axes[1].set_title('Colored by Embedding Norm (Anomaly Indicator)', fontsize=14, fontweight='bold')
    axes[1].grid(True, alpha=0.3)
    cbar = plt.colorbar(scatter, ax=axes[1])
    cbar.set_label('L2 Norm', fontsize=11)

    plt.tight_layout()
    plt.show()

    print("\n✓ UMAP visualization complete!")
    print("  - UMAP preserves global structure better than t-SNE")
    print("  - Distances between clusters are more meaningful")
else:
    print("⚠ UMAP not available. Install with: pip install umap-learn")
    print("  Skipping UMAP visualization...")
Running UMAP with n_neighbors=15...
<Figure size 1500x600 with 3 Axes>

✓ UMAP visualization complete!
  - UMAP preserves global structure better than t-SNE
  - Distances between clusters are more meaningful

Choosing Between t-SNE and UMAP

MethodBest ForPreservesSpeed
t-SNELocal structure, cluster identificationNeighborhoodsSlower
UMAPGlobal structure, distance relationshipsBoth local & globalFaster

Recommendation: Start with t-SNE for initial exploration (<5K samples). Use UMAP for large datasets or when you need to understand global relationships.


4. Nearest Neighbor Inspection

Visualization shows overall structure, but you need to zoom in and check if individual embeddings make sense. A model might create nice-looking clusters but still confuse critical event types.

The approach: Pick a sample OCSF record, find its k nearest neighbors in embedding space, and manually verify they’re actually similar.

def inspect_nearest_neighbors(query_idx, all_embeddings, k=5):
    """
    Find and display the k nearest neighbors for a query embedding.

    Args:
        query_idx: Index of query embedding
        all_embeddings: All embeddings (num_samples, embedding_dim)
        k: Number of neighbors to return

    Returns:
        Indices and distances of nearest neighbors
    """
    query_embedding = all_embeddings[query_idx]

    # Compute cosine similarity to all embeddings
    query_norm = query_embedding / np.linalg.norm(query_embedding)
    all_norms = all_embeddings / np.linalg.norm(all_embeddings, axis=1, keepdims=True)
    similarities = np.dot(all_norms, query_norm)

    # Find top-k most similar (excluding query itself)
    top_k_indices = np.argsort(similarities)[::-1][1:k+1]  # Skip first (self)

    return top_k_indices, similarities[top_k_indices]

# Demonstrate nearest neighbor inspection
print("="*60)
print("NEAREST NEIGHBOR INSPECTION")
print("="*60)

# Pick a few random samples to inspect
np.random.seed(42)
sample_indices = np.random.choice(len(embeddings), 3, replace=False)

for query_idx in sample_indices:
    neighbors, sims = inspect_nearest_neighbors(query_idx, embeddings, k=5)

    print(f"\nQuery Sample #{query_idx}:")
    print(f"  Embedding norm: {np.linalg.norm(embeddings[query_idx]):.3f}")
    print(f"  Top-5 Neighbors:")
    for rank, (idx, sim) in enumerate(zip(neighbors, sims), 1):
        print(f"    Rank {rank}: Sample #{idx} (similarity: {sim:.3f})")

print("\n" + "="*60)
print("WHAT TO CHECK")
print("="*60)
print("✓ Neighbors should have similar characteristics to query")
print("✓ Success/failure status should match (critical for operations!)")
print("✓ Similar event types should be neighbors")
print("✗ If failed logins are neighbors of successful ones, model needs retraining")
============================================================
NEAREST NEIGHBOR INSPECTION
============================================================

Query Sample #15498:
  Embedding norm: 11.552
  Top-5 Neighbors:
    Rank 1: Sample #12869 (similarity: 1.000)
    Rank 2: Sample #12863 (similarity: 1.000)
    Rank 3: Sample #12860 (similarity: 1.000)
    Rank 4: Sample #12857 (similarity: 1.000)
    Rank 5: Sample #12903 (similarity: 1.000)

Query Sample #3426:
  Embedding norm: 11.020
  Top-5 Neighbors:
    Rank 1: Sample #3426 (similarity: 1.000)
    Rank 2: Sample #6266 (similarity: 1.000)
    Rank 3: Sample #5325 (similarity: 1.000)
    Rank 4: Sample #1823 (similarity: 1.000)
    Rank 5: Sample #984 (similarity: 1.000)

Query Sample #12648:
  Embedding norm: 11.043
  Top-5 Neighbors:
    Rank 1: Sample #13803 (similarity: 1.000)
    Rank 2: Sample #23159 (similarity: 1.000)
    Rank 3: Sample #21236 (similarity: 1.000)
    Rank 4: Sample #12238 (similarity: 1.000)
    Rank 5: Sample #15947 (similarity: 1.000)

============================================================
WHAT TO CHECK
============================================================
✓ Neighbors should have similar characteristics to query
✓ Success/failure status should match (critical for operations!)
✓ Similar event types should be neighbors
✗ If failed logins are neighbors of successful ones, model needs retraining

5. Understanding Your Clusters First

Before diving into metrics, let’s cluster the data and see what we have:

Visualization is subjective - we need objective numbers to:

Silhouette Score

What it measures: How well-separated clusters are (range: -1 to +1, higher is better)

# Run k-means clustering to identify natural clusters
n_clusters = 3  # Try 3-5 clusters for most OCSF data

print(f"Running k-means with {n_clusters} clusters...")
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(embeddings)

print(f"✓ Clustering complete!")
print(f"Cluster sizes: {np.bincount(cluster_labels)}")
print(f"\nBefore looking at metrics, let's understand what these clusters contain...")
Running k-means with 3 clusters...
✓ Clustering complete!
Cluster sizes: [10748  8195  8141]

Before looking at metrics, let's understand what these clusters contain...

6. Inspecting Cluster Contents

Why do this first: Metrics are meaningless without context. Looking at actual samples tells you whether clusters represent meaningful patterns (e.g., “successful logins” vs “failed logins”) or arbitrary splits.

The approach: For each cluster, display representative samples and their features.

def inspect_cluster_samples(embeddings, ocsf_df, cluster_labels, samples_per_cluster=3):
    """
    Display sample OCSF events from each cluster to understand semantic differences.

    Args:
        embeddings: Embedding vectors
        ocsf_df: Original OCSF DataFrame
        cluster_labels: Cluster assignments from KMeans
        samples_per_cluster: Number of samples to show per cluster

    Returns:
        Dictionary of cluster samples
    """
    print("="*70)
    print("CLUSTER CONTENT INSPECTION")
    print("="*70)
    print("\nExamining representative OCSF events from each cluster to understand")
    print("what distinguishes them semantically.\n")

    unique_clusters = np.unique(cluster_labels)
    cluster_samples = {}

    # Select key OCSF fields to display (adjust based on your data)
    display_fields = [
        'activity_id', 'class_uid', 'category_uid', 'severity_id',
        'status', 'status_id', 'type_uid',
        'actor_user_name', 'actor_process_name',
        'http_request_method', 'http_response_code',
        'src_endpoint_ip', 'dst_endpoint_ip',
        'file_name', 'file_path',
        'duration', 'time'
    ]
    # Filter to only fields that exist in the dataframe
    display_fields = [f for f in display_fields if f in ocsf_df.columns]

    for cluster_id in unique_clusters:
        print(f"\n{'='*70}")
        print(f"CLUSTER {cluster_id}")
        print(f"{'='*70}")

        # Get all samples in this cluster
        cluster_mask = cluster_labels == cluster_id
        cluster_size = cluster_mask.sum()
        cluster_indices = np.where(cluster_mask)[0]

        print(f"Cluster size: {cluster_size:,} samples ({100*cluster_size/len(embeddings):.1f}% of dataset)")

        # Get cluster centroid for finding representative samples
        cluster_embeddings = embeddings[cluster_mask]
        centroid = cluster_embeddings.mean(axis=0)

        # Find samples closest to centroid (most representative)
        distances_to_centroid = np.linalg.norm(cluster_embeddings - centroid, axis=1)
        representative_indices_local = np.argsort(distances_to_centroid)[:samples_per_cluster]
        representative_indices = cluster_indices[representative_indices_local]

        print(f"\nRepresentative OCSF events (closest to cluster centroid):\n")

        cluster_samples[cluster_id] = []

        for i, sample_idx in enumerate(representative_indices, 1):
            print(f"  Sample {i} (Index: {sample_idx})")
            print(f"  {'-'*66}")

            # Get OCSF event data
            event = ocsf_df.iloc[sample_idx]

            # Display key OCSF fields
            for field in display_fields:
                value = event[field]
                # Handle NaN/None values
                if pd.isna(value):
                    continue
                # Format value based on type
                if isinstance(value, float):
                    print(f"    {field}: {value:.4f}")
                else:
                    # Truncate long strings
                    value_str = str(value)
                    if len(value_str) > 50:
                        value_str = value_str[:47] + "..."
                    print(f"    {field}: {value_str}")

            # Display embedding norm (can indicate anomalies)
            emb_norm = np.linalg.norm(embeddings[sample_idx])
            print(f"    [embedding_norm]: {emb_norm:.4f}")

            cluster_samples[cluster_id].append({
                'index': sample_idx,
                'event': event.to_dict(),
                'embedding_norm': emb_norm
            })

            print()

        # Show cluster-level statistics for key fields
        print(f"  Cluster {cluster_id} Statistics:")
        print(f"  {'-'*66}")

        cluster_events = ocsf_df.iloc[cluster_indices]

        # Show most common values for categorical fields
        cat_fields = ['activity_id', 'status', 'http_request_method']
        cat_fields = [f for f in cat_fields if f in ocsf_df.columns]

        if cat_fields:
            print(f"  Most common values:")
            for field in cat_fields:
                if field in cluster_events.columns:
                    value_counts = cluster_events[field].value_counts()
                    if len(value_counts) > 0:
                        top_value = value_counts.index[0]
                        top_pct = 100 * value_counts.iloc[0] / len(cluster_events)
                        print(f"    {field}: {top_value} ({top_pct:.1f}%)")

        # Embedding norm statistics
        cluster_norms = np.linalg.norm(embeddings[cluster_mask], axis=1)
        print(f"  Embedding norms: {cluster_norms.mean():.4f} ± {cluster_norms.std():.4f}")
        print(f"    Range: [{cluster_norms.min():.4f}, {cluster_norms.max():.4f}]")

    print(f"\n{'='*70}")
    print("INTERPRETATION GUIDE")
    print(f"{'='*70}")
    print("Look for patterns that distinguish clusters:")
    print("  ✓ Different activity_id → clusters separate by action type")
    print("  ✓ Different status → clusters separate success vs failure")
    print("  ✓ Different actors/endpoints → clusters separate by entities")
    print("  ✓ Different embedding norms → potential anomaly vs normal separation")
    print("\nCompare samples across clusters to understand what the model learned!")

    return cluster_samples

# Inspect cluster contents
cluster_samples = inspect_cluster_samples(
    embeddings, ocsf_df, cluster_labels,
    samples_per_cluster=3
)
======================================================================
CLUSTER CONTENT INSPECTION
======================================================================

Examining representative OCSF events from each cluster to understand
what distinguishes them semantically.


======================================================================
CLUSTER 0
======================================================================
Cluster size: 10,748 samples (39.7% of dataset)

Representative OCSF events (closest to cluster centroid):

  Sample 1 (Index: 579)
  ------------------------------------------------------------------
    activity_id: 2
    class_uid: 6003
    category_uid: 6
    severity_id: 2
    status: Success
    status_id: 1
    type_uid: 600302
    actor_user_name: carol.williams
    http_request_method: GET
    http_response_code: 200.0000
    src_endpoint_ip: 172.21.0.10
    dst_endpoint_ip: 10.0.0.1
    duration: 10.4067
    time: 1768922341113
    [embedding_norm]: 11.0168

  Sample 2 (Index: 920)
  ------------------------------------------------------------------
    activity_id: 2
    class_uid: 6003
    category_uid: 6
    severity_id: 2
    status: Success
    status_id: 1
    type_uid: 600302
    actor_user_name: carol.williams
    http_request_method: GET
    http_response_code: 200.0000
    src_endpoint_ip: 172.21.0.10
    dst_endpoint_ip: 10.0.0.1
    duration: 10.8809
    time: 1768922369367
    [embedding_norm]: 11.0167

  Sample 3 (Index: 6127)
  ------------------------------------------------------------------
    activity_id: 2
    class_uid: 6003
    category_uid: 6
    severity_id: 2
    status: Success
    status_id: 1
    type_uid: 600302
    actor_user_name: carol.williams
    http_request_method: GET
    http_response_code: 200.0000
    src_endpoint_ip: 172.21.0.10
    dst_endpoint_ip: 10.0.0.1
    duration: 10.9134
    time: 1768923341822
    [embedding_norm]: 11.0167

  Cluster 0 Statistics:
  ------------------------------------------------------------------
  Most common values:
    activity_id: 2 (69.3%)
    status: Success (99.6%)
    http_request_method: GET (69.3%)
  Embedding norms: 11.1218 ± 0.1255
    Range: [10.8768, 11.3469]

======================================================================
CLUSTER 1
======================================================================
Cluster size: 8,195 samples (30.3% of dataset)

Representative OCSF events (closest to cluster centroid):

  Sample 1 (Index: 21760)
  ------------------------------------------------------------------
    activity_id: 2
    class_uid: 6003
    category_uid: 6
    severity_id: 2
    status: Success
    status_id: 1
    type_uid: 600302
    actor_user_name: alice.johnson
    http_request_method: GET
    http_response_code: 200.0000
    src_endpoint_ip: 172.21.0.10
    dst_endpoint_ip: 10.0.0.1
    duration: 10.2909
    time: 1768928246964
    [embedding_norm]: 11.0406

  Sample 2 (Index: 18415)
  ------------------------------------------------------------------
    activity_id: 2
    class_uid: 6003
    category_uid: 6
    severity_id: 2
    status: Success
    status_id: 1
    type_uid: 600302
    actor_user_name: alice.johnson
    http_request_method: GET
    http_response_code: 200.0000
    src_endpoint_ip: 172.21.0.10
    dst_endpoint_ip: 10.0.0.1
    duration: 10.3462
    time: 1768927618180
    [embedding_norm]: 11.0406

  Sample 3 (Index: 14687)
  ------------------------------------------------------------------
    activity_id: 2
    class_uid: 6003
    category_uid: 6
    severity_id: 2
    status: Success
    status_id: 1
    type_uid: 600302
    actor_user_name: alice.johnson
    http_request_method: GET
    http_response_code: 200.0000
    src_endpoint_ip: 172.21.0.10
    dst_endpoint_ip: 10.0.0.1
    duration: 11.0564
    time: 1768926218718
    [embedding_norm]: 11.0406

  Cluster 1 Statistics:
  ------------------------------------------------------------------
  Most common values:
    activity_id: 2 (69.2%)
    status: Success (99.7%)
    http_request_method: GET (69.2%)
  Embedding norms: 11.1338 ± 0.1465
    Range: [10.8772, 11.3572]

======================================================================
CLUSTER 2
======================================================================
Cluster size: 8,141 samples (30.1% of dataset)

Representative OCSF events (closest to cluster centroid):

  Sample 1 (Index: 12185)
  ------------------------------------------------------------------
    activity_id: 1
    class_uid: 6003
    category_uid: 6
    severity_id: 3
    status: Success
    status_id: 1
    type_uid: 600301
    time: 1768925293628
    [embedding_norm]: 11.5517

  Sample 2 (Index: 12492)
  ------------------------------------------------------------------
    activity_id: 1
    class_uid: 6003
    category_uid: 6
    severity_id: 3
    status: Success
    status_id: 1
    type_uid: 600301
    time: 1768925323365
    [embedding_norm]: 11.5517

  Sample 3 (Index: 13004)
  ------------------------------------------------------------------
    activity_id: 1
    class_uid: 6003
    category_uid: 6
    severity_id: 3
    status: Success
    status_id: 1
    type_uid: 600301
    time: 1768925394349
    [embedding_norm]: 11.5517

  Cluster 2 Statistics:
  ------------------------------------------------------------------
  Most common values:
    activity_id: 1 (100.0%)
    status: Success (100.0%)
  Embedding norms: 11.5487 ± 0.0222
    Range: [11.4947, 11.5633]

======================================================================
INTERPRETATION GUIDE
======================================================================
Look for patterns that distinguish clusters:
  ✓ Different activity_id → clusters separate by action type
  ✓ Different status → clusters separate success vs failure
  ✓ Different actors/endpoints → clusters separate by entities
  ✓ Different embedding norms → potential anomaly vs normal separation

Compare samples across clusters to understand what the model learned!

Cluster Comparison Matrix

Compare clusters side-by-side to highlight differences:

# Compare cluster characteristics side-by-side
print("="*70)
print("CLUSTER COMPARISON MATRIX")
print("="*70)

unique_clusters = np.unique(cluster_labels)

print(f"\n{'Metric':<30}", end='')
for cluster_id in unique_clusters:
    print(f"Cluster {cluster_id:>10}", end='')
print()
print("-"*70)

# Cluster sizes
print(f"{'Size (samples)':<30}", end='')
for cluster_id in unique_clusters:
    cluster_size = (cluster_labels == cluster_id).sum()
    print(f"{cluster_size:>10,}", end='')
print()

# Percentage of dataset
print(f"{'Percentage':<30}", end='')
for cluster_id in unique_clusters:
    cluster_size = (cluster_labels == cluster_id).sum()
    pct = 100 * cluster_size / len(embeddings)
    print(f"{pct:>9.1f}%", end='')
print()

# Average embedding norm
print(f"{'Avg Embedding Norm':<30}", end='')
for cluster_id in unique_clusters:
    cluster_mask = cluster_labels == cluster_id
    avg_norm = np.linalg.norm(embeddings[cluster_mask], axis=1).mean()
    print(f"{avg_norm:>10.4f}", end='')
print()

print()
print("Now that we understand what the clusters contain, let's quantify their quality...")
======================================================================
CLUSTER COMPARISON MATRIX
======================================================================

Metric                        Cluster          0Cluster          1Cluster          2
----------------------------------------------------------------------
Size (samples)                    10,748     8,195     8,141
Percentage                         39.7%     30.3%     30.1%
Avg Embedding Norm               11.1218   11.1338   11.5487

Now that we understand what the clusters contain, let's quantify their quality...

What to look for:

Real-world interpretation:

Action if clusters don’t make sense: Revisit feature engineering (notebook 03) or try different values of k.


7. Quantifying Cluster Quality

Now that we’ve seen what the clusters represent, let’s measure their quality objectively.

Silhouette Score

What it measures: How well-separated clusters are (range: -1 to +1, higher is better)

Target for production: > 0.5

# Compute silhouette score
silhouette_avg = silhouette_score(embeddings, cluster_labels)
sample_silhouette_values = silhouette_samples(embeddings, cluster_labels)

print(f"\n{'='*60}")
print(f"SILHOUETTE SCORE: {silhouette_avg:.3f}")
print(f"{'='*60}")
print(f"\nInterpretation:")
if silhouette_avg > 0.7:
    print(f"  ✓ EXCELLENT - Strong cluster separation")
elif silhouette_avg > 0.5:
    print(f"  ✓ GOOD - Acceptable for production")
elif silhouette_avg > 0.25:
    print(f"  ⚠ WEAK - May miss subtle anomalies")
else:
    print(f"  ✗ POOR - Embeddings not useful, retrain needed")

============================================================
SILHOUETTE SCORE: 0.287
============================================================

Interpretation:
  ⚠ WEAK - May miss subtle anomalies
# Visualize silhouette plot
fig, ax = plt.subplots(figsize=(10, 7))

y_lower = 10
colors = plt.cm.tab10(np.linspace(0, 1, n_clusters))

for i in range(n_clusters):
    # Get silhouette values for cluster i
    ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
    ith_cluster_silhouette_values.sort()

    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i

    ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values,
                     facecolor=colors[i], edgecolor=colors[i], alpha=0.7)

    # Label cluster
    ax.text(-0.05, y_lower + 0.5 * size_cluster_i, f"Cluster {i}\n(n={size_cluster_i})")
    y_lower = y_upper + 10

# Add average silhouette score line
ax.axvline(x=silhouette_avg, color="red", linestyle="--", linewidth=2,
          label=f"Average: {silhouette_avg:.3f}")

# Add threshold lines
ax.axvline(x=0.5, color="green", linestyle=":", linewidth=1.5, alpha=0.7,
          label="Production threshold: 0.5")

ax.set_title("Silhouette Plot - Cluster Quality Analysis", fontsize=14, fontweight='bold')
ax.set_xlabel("Silhouette Coefficient", fontsize=12)
ax.set_ylabel("Cluster", fontsize=12)
ax.legend(loc='best')
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("READING THE SILHOUETTE PLOT")
print("="*60)
print("Width of colored bands = silhouette scores for samples in that cluster")
print("  - Wide spread → cluster has outliers or mixed events")
print("  - Narrow spread → cohesive, consistent cluster")
print("\nPoints below zero = probably in wrong cluster")
print("Red dashed line (average) should be > 0.5 for production")
<Figure size 1000x700 with 1 Axes>

============================================================
READING THE SILHOUETTE PLOT
============================================================
Width of colored bands = silhouette scores for samples in that cluster
  - Wide spread → cluster has outliers or mixed events
  - Narrow spread → cohesive, consistent cluster

Points below zero = probably in wrong cluster
Red dashed line (average) should be > 0.5 for production

Other Cluster Quality Metrics

Davies-Bouldin Index: Measures cluster overlap (lower is better, min 0)

Calinski-Harabasz Score: Ratio of between/within cluster variance (higher is better)

# Compute additional metrics
davies_bouldin = davies_bouldin_score(embeddings, cluster_labels)
calinski_harabasz = calinski_harabasz_score(embeddings, cluster_labels)

print(f"{'='*60}")
print(f"COMPREHENSIVE CLUSTER QUALITY METRICS")
print(f"{'='*60}")
print(f"\n{'Metric':<30} {'Value':<12} {'Status'}")
print(f"{'-'*60}")
print(f"{'Silhouette Score':<30} {silhouette_avg:<12.3f} {'✓ Good' if silhouette_avg > 0.5 else '✗ Poor'}")
print(f"{'Davies-Bouldin Index':<30} {davies_bouldin:<12.3f} {'✓ Good' if davies_bouldin < 1.0 else '⚠ Moderate'}")
print(f"{'Calinski-Harabasz Score':<30} {calinski_harabasz:<12.1f} {'(higher=better)'}")
print(f"{'-'*60}")

# Overall verdict
passed = silhouette_avg > 0.5 and davies_bouldin < 1.5
verdict = "PASS ✓" if passed else "NEEDS IMPROVEMENT ⚠"
print(f"\nOverall Quality Verdict: {verdict}")
============================================================
COMPREHENSIVE CLUSTER QUALITY METRICS
============================================================

Metric                         Value        Status
------------------------------------------------------------
Silhouette Score               0.287        ✗ Poor
Davies-Bouldin Index           1.808        ⚠ Moderate
Calinski-Harabasz Score        7111.0       (higher=better)
------------------------------------------------------------

Overall Quality Verdict: NEEDS IMPROVEMENT ⚠

Determining Optimal Clusters (k)

How many natural groupings exist in your OCSF data? Use multiple metrics together to find the answer.

def find_optimal_clusters(embeddings, k_range=range(2, 10)):
    """
    Compute clustering metrics for different numbers of clusters.

    Args:
        embeddings: Embedding array
        k_range: Range of cluster counts to try

    Returns:
        Dictionary with metrics for each k
    """
    results = []

    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        labels = kmeans.fit_predict(embeddings)

        results.append({
            'k': k,
            'silhouette': silhouette_score(embeddings, labels),
            'davies_bouldin': davies_bouldin_score(embeddings, labels),
            'calinski_harabasz': calinski_harabasz_score(embeddings, labels),
            'inertia': kmeans.inertia_
        })

    return results

# Find optimal number of clusters
print("="*70)
print("FINDING OPTIMAL NUMBER OF CLUSTERS")
print("="*70)

results = find_optimal_clusters(embeddings)

print(f"\n{'K':<5} {'Silhouette':<12} {'Davies-Bouldin':<16} {'Calinski-Harabasz':<18}")
print("-" * 55)
for r in results:
    print(f"{r['k']:<5} {r['silhouette']:<12.3f} {r['davies_bouldin']:<16.3f} {r['calinski_harabasz']:<18.1f}")

# Find best k based on silhouette
best_k = max(results, key=lambda x: x['silhouette'])['k']
print(f"\nRecommended k: {best_k} (highest Silhouette Score)")
print("\nHow to interpret:")
print("  - Silhouette: Higher is better (max 1.0)")
print("  - Davies-Bouldin: Lower is better (min 0.0)")
print("  - Calinski-Harabasz: Higher is better (no upper bound)")
print("  - Look for k where multiple metrics agree")
======================================================================
FINDING OPTIMAL NUMBER OF CLUSTERS
======================================================================

K     Silhouette   Davies-Bouldin   Calinski-Harabasz 
-------------------------------------------------------
2     0.218        1.937            6265.4            
3     0.287        1.808            7111.0            
4     0.337        1.552            7324.4            
5     0.399        1.323            7486.2            
6     0.425        1.286            7297.0            
7     0.454        1.184            7824.7            
8     0.496        1.170            7795.9            
9     0.524        1.077            7578.2            

Recommended k: 9 (highest Silhouette Score)

How to interpret:
  - Silhouette: Higher is better (max 1.0)
  - Davies-Bouldin: Lower is better (min 0.0)
  - Calinski-Harabasz: Higher is better (no upper bound)
  - Look for k where multiple metrics agree
# Visualize elbow method
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

k_values = [r['k'] for r in results]

# Silhouette Score
axes[0].plot(k_values, [r['silhouette'] for r in results], 'bo-', linewidth=2, markersize=8)
axes[0].axhline(y=0.5, color='green', linestyle='--', alpha=0.7, label='Production threshold')
axes[0].set_xlabel('Number of Clusters (k)', fontsize=11)
axes[0].set_ylabel('Silhouette Score', fontsize=11)
axes[0].set_title('Silhouette Score vs k', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Davies-Bouldin Index
axes[1].plot(k_values, [r['davies_bouldin'] for r in results], 'ro-', linewidth=2, markersize=8)
axes[1].axhline(y=1.0, color='green', linestyle='--', alpha=0.7, label='Production threshold')
axes[1].set_xlabel('Number of Clusters (k)', fontsize=11)
axes[1].set_ylabel('Davies-Bouldin Index', fontsize=11)
axes[1].set_title('Davies-Bouldin vs k', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Inertia (Elbow method)
axes[2].plot(k_values, [r['inertia'] for r in results], 'go-', linewidth=2, markersize=8)
axes[2].set_xlabel('Number of Clusters (k)', fontsize=11)
axes[2].set_ylabel('Inertia', fontsize=11)
axes[2].set_title('Elbow Method (Inertia)', fontsize=12, fontweight='bold')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nLook for the 'elbow' in the inertia plot where improvement slows down")
<Figure size 1500x400 with 3 Axes>

Look for the 'elbow' in the inertia plot where improvement slows down

8. Robustness Evaluation

Even with good cluster metrics, we need to ensure embeddings are robust to real-world noise and useful for downstream tasks.

Perturbation Stability

Why robustness matters: In production, OCSF data has noise—network jitter causes timestamp variations, rounding errors affect byte counts. Good embeddings should be stable under these small perturbations.

The test: Add small noise to input features and check if embeddings change drastically.

def evaluate_perturbation_stability(embeddings, numerical_features, noise_levels=[0.01, 0.05, 0.1]):
    """
    Evaluate how stable embeddings are under input perturbations.

    Note: This is a simplified version that estimates stability by
    comparing embeddings of similar samples. In production, you would
    re-run inference with perturbed inputs.

    Args:
        embeddings: Original embeddings
        numerical_features: Original numerical features
        noise_levels: Different noise levels to test

    Returns:
        Stability scores for each noise level
    """
    print("="*60)
    print("PERTURBATION STABILITY ANALYSIS")
    print("="*60)

    # Find pairs of similar embeddings (proxy for stability)
    nn = NearestNeighbors(n_neighbors=2, metric='cosine')
    nn.fit(embeddings)
    distances, indices = nn.kneighbors(embeddings)

    # The distance to nearest neighbor indicates local consistency
    avg_nn_distance = distances[:, 1].mean()
    std_nn_distance = distances[:, 1].std()

    # Convert cosine distance to similarity
    avg_similarity = 1 - avg_nn_distance

    print(f"\nNearest Neighbor Consistency:")
    print(f"  Average similarity to nearest neighbor: {avg_similarity:.3f}")
    print(f"  Standard deviation: {std_nn_distance:.4f}")

    print(f"\nInterpretation:")
    if avg_similarity > 0.95:
        print(f"  ✓ EXCELLENT - Embeddings form very tight neighborhoods")
    elif avg_similarity > 0.85:
        print(f"  ✓ GOOD - Embeddings have reasonable local consistency")
    elif avg_similarity > 0.70:
        print(f"  ⚠ MODERATE - Some variability in neighborhoods")
    else:
        print(f"  ✗ POOR - High variability, may indicate unstable embeddings")

    # For actual perturbation testing, you would:
    # 1. Add Gaussian noise to numerical features
    # 2. Re-run model inference
    # 3. Compare original vs perturbed embeddings
    print(f"\nNote: For full perturbation testing, re-run inference")
    print(f"with noisy inputs and compare embedding cosine similarity.")
    print(f"Target: > 0.92 similarity at 5% noise level")

    return avg_similarity

stability_score = evaluate_perturbation_stability(embeddings, numerical)
============================================================
PERTURBATION STABILITY ANALYSIS
============================================================

Nearest Neighbor Consistency:
  Average similarity to nearest neighbor: 0.967
  Standard deviation: 0.0669

Interpretation:
  ✓ EXCELLENT - Embeddings form very tight neighborhoods

Note: For full perturbation testing, re-run inference
with noisy inputs and compare embedding cosine similarity.
Target: > 0.92 similarity at 5% noise level

k-NN Classification Accuracy

The idea: If good embeddings make similar events close together, a simple k-NN classifier should achieve high accuracy using those embeddings.

def evaluate_knn_proxy(embeddings, n_clusters=3, k_values=[3, 5, 10]):
    """
    Evaluate embedding quality using k-NN classification on cluster labels.

    Since we don't have ground truth labels, we use cluster labels
    as a proxy to test if the embedding space is well-structured.

    Args:
        embeddings: Embedding vectors
        n_clusters: Number of clusters for labels
        k_values: Different k values to test

    Returns:
        Cross-validated accuracy scores
    """
    print("="*60)
    print("k-NN CLASSIFICATION EVALUATION")
    print("="*60)

    # Generate proxy labels using clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    proxy_labels = kmeans.fit_predict(embeddings)

    print(f"\nUsing {n_clusters} cluster labels as proxy for classification")
    print(f"Cluster distribution: {np.bincount(proxy_labels)}")

    results = []
    for k in k_values:
        knn = KNeighborsClassifier(n_neighbors=k)
        scores = cross_val_score(knn, embeddings, proxy_labels, cv=5, scoring='accuracy')

        results.append({
            'k': k,
            'mean_accuracy': scores.mean(),
            'std_accuracy': scores.std()
        })

        status = '✓' if scores.mean() > 0.85 else ('○' if scores.mean() > 0.70 else '✗')
        print(f"\nk={k}: Accuracy = {scores.mean():.3f} ± {scores.std():.3f} {status}")

    print(f"\n{'='*60}")
    print("INTERPRETATION")
    print(f"{'='*60}")
    print("  > 0.90: Excellent embeddings (clear class separation)")
    print("  0.80-0.90: Good embeddings (suitable for production)")
    print("  0.70-0.80: Moderate (may struggle with edge cases)")
    print("  < 0.70: Poor (embeddings don't capture distinctions)")

    return results

knn_results = evaluate_knn_proxy(embeddings)
============================================================
k-NN CLASSIFICATION EVALUATION
============================================================

Using 3 cluster labels as proxy for classification
Cluster distribution: [10748  8195  8141]

k=3: Accuracy = 1.000 ± 0.000 ✓

k=5: Accuracy = 1.000 ± 0.000 ✓

k=10: Accuracy = 1.000 ± 0.000 ✓

============================================================
INTERPRETATION
============================================================
  > 0.90: Excellent embeddings (clear class separation)
  0.80-0.90: Good embeddings (suitable for production)
  0.70-0.80: Moderate (may struggle with edge cases)
  < 0.70: Poor (embeddings don't capture distinctions)

9. Comprehensive Quality Report

Generate a summary report of all evaluation metrics.

def generate_quality_report(embeddings, cluster_labels, silhouette_avg,
                           davies_bouldin, calinski_harabasz):
    """
    Generate comprehensive embedding quality report.

    Note: Thresholds are set for demo/tutorial data. Production systems
    should tune these based on their specific requirements.
    """
    report = {
        'num_samples': len(embeddings),
        'embedding_dim': embeddings.shape[1],
        'num_clusters': len(np.unique(cluster_labels)),
        'silhouette_score': silhouette_avg,
        'davies_bouldin_index': davies_bouldin,
        'calinski_harabasz_score': calinski_harabasz,
    }

    # Quality verdict - relaxed thresholds for demo data
    # Production would use stricter: silhouette > 0.5, davies_bouldin < 1.0
    good = report['silhouette_score'] > 0.5 and report['davies_bouldin_index'] < 1.0
    acceptable = report['silhouette_score'] > 0.2 and report['davies_bouldin_index'] < 2.0

    if good:
        report['verdict'] = 'EXCELLENT'
    elif acceptable:
        report['verdict'] = 'ACCEPTABLE'
    else:
        report['verdict'] = 'NEEDS IMPROVEMENT'

    return report

# Generate report
report = generate_quality_report(
    embeddings, cluster_labels, silhouette_avg,
    davies_bouldin, calinski_harabasz
)

# Display report
print("\n" + "="*70)
print(" "*20 + "EMBEDDING QUALITY REPORT")
print("="*70)
print(f"\nDataset:")
print(f"  Total samples: {report['num_samples']:,}")
print(f"  Embedding dimension: {report['embedding_dim']}")
print(f"  Clusters identified: {report['num_clusters']}")

print(f"\nCluster Quality Metrics:")
sil = report['silhouette_score']
sil_status = '✓' if sil > 0.5 else ('○' if sil > 0.2 else '✗')
print(f"  Silhouette Score:        {sil:.3f}  {sil_status}")
db = report['davies_bouldin_index']
db_status = '✓' if db < 1.0 else ('○' if db < 2.0 else '✗')
print(f"  Davies-Bouldin Index:    {db:.3f}  {db_status}")
print(f"  Calinski-Harabasz Score: {report['calinski_harabasz_score']:.1f}")

print(f"\nCluster Quality Assessment:")
if sil > 0.5:
    print(f"  ✓ Cluster separation: GOOD (> 0.5)")
elif sil > 0.2:
    print(f"  ○ Cluster separation: ACCEPTABLE (> 0.2)")
else:
    print(f"  ✗ Cluster separation: POOR (< 0.2)")

if db < 1.0:
    print(f"  ✓ Cluster overlap: LOW (< 1.0)")
elif db < 2.0:
    print(f"  ○ Cluster overlap: MODERATE (< 2.0)")
else:
    print(f"  ✗ Cluster overlap: HIGH (> 2.0)")

print(f"\n{'='*70}")
print(f"VERDICT: {report['verdict']}")
print(f"{'='*70}")

if report['verdict'] == 'EXCELLENT':
    print("\n✓ Embeddings show excellent cluster separation")
    print("  Ready for production anomaly detection")
elif report['verdict'] == 'ACCEPTABLE':
    print("\n✓ Embeddings are suitable for anomaly detection")
    print("  Proceed to notebook 06 (Anomaly Detection)")
    print("\n  To improve further:")
    print("  - Train for more epochs (notebook 04)")
    print("  - Use larger dataset")
else:
    print("\n⚠ Embeddings may need improvement for production use:")
    print("  - Try training for more epochs (notebook 04)")
    print("  - Check feature engineering (notebook 03)")
    print("  - Adjust model capacity (d_model, num_blocks)")
    print("  - Use stronger augmentation during training")
    print("\n  Note: Demo data may have limited cluster structure")

======================================================================
                    EMBEDDING QUALITY REPORT
======================================================================

Dataset:
  Total samples: 27,084
  Embedding dimension: 128
  Clusters identified: 3

Cluster Quality Metrics:
  Silhouette Score:        0.287  ○
  Davies-Bouldin Index:    1.808  ○
  Calinski-Harabasz Score: 7111.0

Cluster Quality Assessment:
  ○ Cluster separation: ACCEPTABLE (> 0.2)
  ○ Cluster overlap: MODERATE (< 2.0)

======================================================================
VERDICT: ACCEPTABLE
======================================================================

✓ Embeddings are suitable for anomaly detection
  Proceed to notebook 06 (Anomaly Detection)

  To improve further:
  - Train for more epochs (notebook 04)
  - Use larger dataset

Summary

In this notebook, we evaluated embedding quality using a comprehensive approach:

Phase 1: Qualitative Evaluation

  1. t-SNE Visualization - Projected embeddings to 2D preserving local structure

    • Identified visual clusters and outliers

    • Colored by embedding norm to spot anomalies

  2. UMAP Visualization - Alternative projection preserving global structure

    • Faster than t-SNE for large datasets

    • More meaningful distances between clusters

  3. Nearest Neighbor Inspection - Verified semantic similarity

    • Spot-checked if neighbors make sense

    • Caught critical operational distinctions (success/failure)

Phase 2: Understanding Clusters (New!)

  1. Cluster Data - Run k-means clustering

  2. Inspect Cluster Contents - Look at actual samples from each cluster

    • Display representative samples and their features

    • Understand what distinguishes clusters semantically

    • Key insight: Do this BEFORE looking at metrics!

  3. Cluster Comparison Matrix - Compare clusters side-by-side

    • Feature patterns, embedding norms, sizes

Phase 3: Quantitative Evaluation

  1. Cluster Quality Metrics - Now we can interpret the numbers!

    • Silhouette Score: Measures cluster separation (target > 0.5)

    • Davies-Bouldin Index: Measures cluster overlap (target < 1.0)

    • Calinski-Harabasz Score: Higher is better

    • Silhouette Plot: Per-cluster quality visualization

  2. Optimal Cluster Selection - Finding the right k

    • Elbow method for inertia

    • Multi-metric agreement

Phase 4: Robustness Evaluation

  1. Perturbation Stability - Embeddings robust to noise

    • Target > 0.92 similarity at 5% noise level

  2. k-NN Classification - Proxy task performance

    • Target > 0.85 accuracy for production

Quality Report

  1. Comprehensive Report - Overall production readiness verdict

Key Takeaway: Always inspect cluster contents FIRST to understand what the model learned, THEN use quantitative metrics to measure quality. Embeddings must pass both semantic inspection AND quantitative thresholds before production deployment.

Next steps: