Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

L08 - Architecture Design Choices: Choosing Your Hyperparameters

From toy models to production: How to design your GPT architecture


We built a complete GPT in L07 - Assembling the GPT. But when you look at the code, you see parameters like d_model, n_heads, and n_layers. How do you choose these values?

This isn’t arbitrary. There are established patterns, mathematical constraints, and practical trade-offs that guide these decisions. In this lesson, we’ll learn how to design architectures for different use cases—from tiny models that run on your laptop to production-scale systems.

By the end of this post, you’ll understand:


Part 1: The Core Hyperparameters

When designing a GPT architecture, you have six primary hyperparameters to choose:

The Six Knobs

ParameterSymbolDescriptionTypical Values
Model Dimensiond_modelWidth of token embeddings512, 768, 1024, 2048, 4096
Number of Layersn_layersDepth of the transformer stack6, 12, 24, 32, 96
Attention Headsn_headsParallel attention mechanisms per layer8, 12, 16, 32, 64
FFN Dimensiond_ffHidden size in feed-forward networkUsually 4 × d_model
Vocabulary Sizevocab_sizeNumber of unique tokens32k, 50k, 100k
Context Windowmax_lenMaximum sequence length512, 2048, 4096, 8192

Mathematical Constraints

Not all combinations are valid. Here are the hard rules:

1. Head Dimension Constraint

assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
d_head = d_model // n_heads  # Head dimension

Why? Multi-head attention splits d_model into n_heads equal chunks. If d_model=512 and n_heads=8, each head gets d_head=64 dimensions.

2. Typical Head Dimension

In practice, d_head is almost always 64:

This means: d_model = n_heads × 64 (or 128 for very large models)

3. FFN Expansion Factor

The feed-forward network typically expands by :

d_ff = 4 * d_model

So if d_model=768, then d_ff=3072.


Part 2: Real-World Architecture Examples

Let’s look at how actual models are configured:

Modeld_modeln_layersn_headsd_headParameters
Tiny (Custom)5126864~40M
GPT-2 Small768121264124M
GPT-2 Medium1024241664355M
GPT-2 Large1280362064774M
GPT-2 XL16004825641.5B
GPT-3 Small20482416128350M
Llama 2 7B409632321287B
Llama 2 13B5120404012813B
GPT-3 175B122889696128175B

Key Observations

  1. Head dimension stays constant: Almost always 64 or 128

  2. Both depth and width scale: Larger models increase both n_layers and d_model

  3. Heads scale with width: More heads as the model gets wider

  4. Layers increase more slowly: Depth grows from 6 → 96, while width grows from 512 → 12288


Part 3: Counting Parameters

Understanding parameter count helps you estimate memory and compute requirements.

Parameter Count Formula

For a transformer with:

Total parameters ≈:

Token Embeddings:    V × d
Position Embeddings: T × d (if learned, 0 if sinusoidal)

Per Layer (×L):
  - Multi-Head Attention:
    - Q, K, V projections: 3 × (d × d) = 3d²
    - Output projection: d × d = d²
    - Total: 4d²

  - Feed-Forward Network:
    - First linear: d × 4d = 4d²
    - Second linear: 4d × d = 4d²
    - Total: 8d²

  - Layer Norms: 4d (negligible)

Final Output:
  - LM Head: V × d (often tied to embedding, so 0)

Total ≈ V×d + T×d + L×(4d² + 8d²) = V×d + T×d + 12Ld²

Dominant term: For large models, 12Ld² dominates.

Example: GPT-2 Small

Embeddings:  50,257 × 768 ≈ 38.6M
Positions:   1,024 × 768 ≈ 0.8M
Layers:      12 × 12 × 768² ≈ 85.0M
────────────────────────────────
Total:       ≈ 124.4M parameters

This matches the official GPT-2 Small size!


Part 4: Design Decision Tree

How do you choose these parameters for YOUR use case?

Start with Your Constraint

If you have limited compute (e.g., laptop, small GPU):

If you want to fine-tune an existing model:

If you’re training from scratch with good compute:

Scaling Strategy

When scaling up, follow these rules:

  1. Scale width and depth together: Don’t just make it wider or deeper

    • Bad: d_model=4096, n_layers=6 (too shallow)

    • Bad: d_model=512, n_layers=96 (too narrow)

    • Good: d_model=1024, n_layers=24

  2. Keep d_head constant at 64 or 128:

    • When increasing d_model, also increase n_heads proportionally

    • Example: d_model=1024, n_heads=16d_head=64

  3. Use 4× expansion in FFN:

    • This is a universal constant: d_ff = 4 * d_model

  4. Vocabulary size:

    • 32k-50k for most tasks

    • 100k for multilingual models

  5. Context window:

    • 512-1024 for older/smaller models

    • 2048-4096 for modern models

    • 8192+ for long-context applications


Part 5: Memory and Compute Estimates

Memory Requirements

Training (most expensive):

Total: ~20 × num_params bytes for training

Example: GPT-2 Small (124M params)

Inference (much cheaper):

Example: GPT-2 Small

Compute Requirements

Compute scales as O(12Ld²) per token:

Example: Llama 2 7B processing 2048 tokens:

On an A100 GPU (312 TFLOPS):


Part 6: Quick Reference Guide

Choose Your Architecture

# Tiny model (for experimentation)
tiny_config = {
    "d_model": 512,
    "n_layers": 6,
    "n_heads": 8,
    "vocab_size": 32_000,
    "max_len": 1024
}
# Parameters: ~40M

# Small model (GPT-2 Small equivalent)
small_config = {
    "d_model": 768,
    "n_layers": 12,
    "n_heads": 12,
    "vocab_size": 50_000,
    "max_len": 2048
}
# Parameters: ~124M

# Medium model (GPT-2 Medium equivalent)
medium_config = {
    "d_model": 1024,
    "n_layers": 24,
    "n_heads": 16,
    "vocab_size": 50_000,
    "max_len": 2048
}
# Parameters: ~355M

# Large model (7B scale)
large_config = {
    "d_model": 4096,
    "n_layers": 32,
    "n_heads": 32,
    "vocab_size": 100_000,
    "max_len": 4096
}
# Parameters: ~7B

Validation Checklist

Before training, verify:


Summary

  1. Mathematical Constraints: d_model must be divisible by n_heads, with d_head typically 64 or 128

  2. Scaling: Increase both width (d_model) and depth (n_layers) together

  3. Real-World Patterns: Follow established configurations (GPT-2, Llama) as starting points

  4. Parameter Count: Dominated by 12Ld² for the transformer layers

  5. Memory: Training requires ~20× parameter count in bytes

  6. Design for Your Use Case: Start small for experimentation, scale up based on performance needs

Next Up: L09 – Training the LLM. Now that you’ve designed your architecture, we’ll learn how to train it with modern optimization techniques, learning rate schedules, and gradient accumulation.