From toy models to production: How to design your GPT architecture
We built a complete GPT in L07 - Assembling the GPT. But when you look at the code, you see parameters like d_model, n_heads, and n_layers. How do you choose these values?
This isn’t arbitrary. There are established patterns, mathematical constraints, and practical trade-offs that guide these decisions. In this lesson, we’ll learn how to design architectures for different use cases—from tiny models that run on your laptop to production-scale systems.
By the end of this post, you’ll understand:
The mathematical constraints between parameters (why
d_modelmust be divisible byn_heads)Common architecture patterns used in real models (GPT-2, GPT-3, Llama)
Trade-offs between model size, speed, and quality
How to estimate parameter counts and memory requirements
Part 1: The Core Hyperparameters¶
When designing a GPT architecture, you have six primary hyperparameters to choose:
The Six Knobs¶
| Parameter | Symbol | Description | Typical Values |
|---|---|---|---|
| Model Dimension | d_model | Width of token embeddings | 512, 768, 1024, 2048, 4096 |
| Number of Layers | n_layers | Depth of the transformer stack | 6, 12, 24, 32, 96 |
| Attention Heads | n_heads | Parallel attention mechanisms per layer | 8, 12, 16, 32, 64 |
| FFN Dimension | d_ff | Hidden size in feed-forward network | Usually 4 × d_model |
| Vocabulary Size | vocab_size | Number of unique tokens | 32k, 50k, 100k |
| Context Window | max_len | Maximum sequence length | 512, 2048, 4096, 8192 |
Mathematical Constraints¶
Not all combinations are valid. Here are the hard rules:
1. Head Dimension Constraint
assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
d_head = d_model // n_heads # Head dimensionWhy? Multi-head attention splits d_model into n_heads equal chunks. If d_model=512 and n_heads=8, each head gets d_head=64 dimensions.
2. Typical Head Dimension
In practice, d_head is almost always 64:
GPT-2 Small:
d_model=768, n_heads=12→d_head=64GPT-3:
d_model=12288, n_heads=96→d_head=128(exception)Llama 2 7B:
d_model=4096, n_heads=32→d_head=128
This means: d_model = n_heads × 64 (or 128 for very large models)
3. FFN Expansion Factor
The feed-forward network typically expands by 4×:
d_ff = 4 * d_modelSo if d_model=768, then d_ff=3072.
Part 2: Real-World Architecture Examples¶
Let’s look at how actual models are configured:
| Model | d_model | n_layers | n_heads | d_head | Parameters |
|---|---|---|---|---|---|
| Tiny (Custom) | 512 | 6 | 8 | 64 | ~40M |
| GPT-2 Small | 768 | 12 | 12 | 64 | 124M |
| GPT-2 Medium | 1024 | 24 | 16 | 64 | 355M |
| GPT-2 Large | 1280 | 36 | 20 | 64 | 774M |
| GPT-2 XL | 1600 | 48 | 25 | 64 | 1.5B |
| GPT-3 Small | 2048 | 24 | 16 | 128 | 350M |
| Llama 2 7B | 4096 | 32 | 32 | 128 | 7B |
| Llama 2 13B | 5120 | 40 | 40 | 128 | 13B |
| GPT-3 175B | 12288 | 96 | 96 | 128 | 175B |
Key Observations¶
Head dimension stays constant: Almost always 64 or 128
Both depth and width scale: Larger models increase both
n_layersandd_modelHeads scale with width: More heads as the model gets wider
Layers increase more slowly: Depth grows from 6 → 96, while width grows from 512 → 12288
Part 3: Counting Parameters¶
Understanding parameter count helps you estimate memory and compute requirements.
Parameter Count Formula¶
For a transformer with:
vocab_size = Vd_model = dn_layers = Ln_heads = hmax_len = T
Total parameters ≈:
Token Embeddings: V × d
Position Embeddings: T × d (if learned, 0 if sinusoidal)
Per Layer (×L):
- Multi-Head Attention:
- Q, K, V projections: 3 × (d × d) = 3d²
- Output projection: d × d = d²
- Total: 4d²
- Feed-Forward Network:
- First linear: d × 4d = 4d²
- Second linear: 4d × d = 4d²
- Total: 8d²
- Layer Norms: 4d (negligible)
Final Output:
- LM Head: V × d (often tied to embedding, so 0)
Total ≈ V×d + T×d + L×(4d² + 8d²) = V×d + T×d + 12Ld²Dominant term: For large models, 12Ld² dominates.
Example: GPT-2 Small¶
V = 50,257d = 768L = 12T = 1024h = 12
Embeddings: 50,257 × 768 ≈ 38.6M
Positions: 1,024 × 768 ≈ 0.8M
Layers: 12 × 12 × 768² ≈ 85.0M
────────────────────────────────
Total: ≈ 124.4M parametersThis matches the official GPT-2 Small size!
Part 4: Design Decision Tree¶
How do you choose these parameters for YOUR use case?
Start with Your Constraint¶
If you have limited compute (e.g., laptop, small GPU):
Start small:
d_model=512, n_layers=6, n_heads=8Parameters: ~40M
Can train on consumer hardware
If you want to fine-tune an existing model:
Use a pretrained model (GPT-2, Llama)
Don’t design from scratch
If you’re training from scratch with good compute:
GPT-2 Small equivalent:
d_model=768, n_layers=12, n_heads=12→ 124MGPT-2 Medium equivalent:
d_model=1024, n_layers=24, n_heads=16→ 355M
Scaling Strategy¶
When scaling up, follow these rules:
Scale width and depth together: Don’t just make it wider or deeper
Bad:
d_model=4096, n_layers=6(too shallow)Bad:
d_model=512, n_layers=96(too narrow)Good:
d_model=1024, n_layers=24
Keep
d_headconstant at 64 or 128:When increasing
d_model, also increasen_headsproportionallyExample:
d_model=1024, n_heads=16→d_head=64✓
Use 4× expansion in FFN:
This is a universal constant:
d_ff = 4 * d_model
Vocabulary size:
32k-50k for most tasks
100k for multilingual models
Context window:
512-1024 for older/smaller models
2048-4096 for modern models
8192+ for long-context applications
Part 5: Memory and Compute Estimates¶
Memory Requirements¶
Training (most expensive):
Model parameters:
4 × num_paramsbytes (FP32)Optimizer states (Adam):
12 × num_paramsbytesGradients:
4 × num_paramsbytesActivations: Depends on batch size and sequence length
Total: ~20 × num_params bytes for training
Example: GPT-2 Small (124M params)
Training: ~2.5 GB (just model + optimizer)
Add activations for batch: ~2-4 GB more
Total: ~5-7 GB for training
Inference (much cheaper):
Model:
4 × num_paramsbytes (FP32) or2 × num_params(FP16)Activations: Minimal for single sequence
Example: GPT-2 Small
Inference: ~500 MB (FP32), ~250 MB (FP16)
Compute Requirements¶
Compute scales as O(12Ld²) per token:
6 operations in attention:
2×d² × seq_len6 operations in FFN:
2×4d² + 2×4d²
Example: Llama 2 7B processing 2048 tokens:
d=4096, L=32FLOPs ≈
12 × 32 × 4096² × 2048 ≈ 13 trillion FLOPs
On an A100 GPU (312 TFLOPS):
Time ≈ 13T / 312T ≈ 40 milliseconds
Part 6: Quick Reference Guide¶
Choose Your Architecture¶
# Tiny model (for experimentation)
tiny_config = {
"d_model": 512,
"n_layers": 6,
"n_heads": 8,
"vocab_size": 32_000,
"max_len": 1024
}
# Parameters: ~40M
# Small model (GPT-2 Small equivalent)
small_config = {
"d_model": 768,
"n_layers": 12,
"n_heads": 12,
"vocab_size": 50_000,
"max_len": 2048
}
# Parameters: ~124M
# Medium model (GPT-2 Medium equivalent)
medium_config = {
"d_model": 1024,
"n_layers": 24,
"n_heads": 16,
"vocab_size": 50_000,
"max_len": 2048
}
# Parameters: ~355M
# Large model (7B scale)
large_config = {
"d_model": 4096,
"n_layers": 32,
"n_heads": 32,
"vocab_size": 100_000,
"max_len": 4096
}
# Parameters: ~7BValidation Checklist¶
Before training, verify:
✅
d_model % n_heads == 0✅
d_head = d_model // n_headsis 64 or 128✅
d_ff = 4 * d_model(if not specified explicitly)✅ Memory requirements fit your hardware
✅ Vocabulary size covers your domain
Summary¶
Mathematical Constraints:
d_modelmust be divisible byn_heads, withd_headtypically 64 or 128Scaling: Increase both width (
d_model) and depth (n_layers) togetherReal-World Patterns: Follow established configurations (GPT-2, Llama) as starting points
Parameter Count: Dominated by
12Ld²for the transformer layersMemory: Training requires ~20× parameter count in bytes
Design for Your Use Case: Start small for experimentation, scale up based on performance needs
Next Up: L09 – Training the LLM. Now that you’ve designed your architecture, we’ll learn how to train it with modern optimization techniques, learning rate schedules, and gradient accumulation.