A complete journey from text to tokens to transformers to chat assistants
Welcome to LLM From Scratch, a hands-on series that demystifies Large Language Models by building a GPT-style transformer from the ground up.
This isn’t just theory—by the end of this series, you’ll understand every component that powers systems like ChatGPT, Claude, and Llama. We’ll write the code, visualize the math, and connect the dots between research papers and real implementations.
What You’ll Build¶
Over 10 lessons, we’ll construct a complete GPT architecture:
A Byte Pair Encoding tokenizer that handles any text
Embedding layers that give words geometric meaning
The self-attention mechanism that lets words talk to each other
Multi-head attention for parallel relationship processing
The causal mask that prevents cheating during training
A complete decoder-only Transformer (the GPT architecture)
A training pipeline with modern optimizers and learning rate schedules
Inference techniques (temperature, top-p, beam search)
Fine-tuning methods (SFT and RLHF) to create chat assistants
The Series¶
Foundation: Text to Tensors¶
Lesson | Title | What You’ll Learn |
|---|---|---|
L01 | Teaching computers to read — Build a BPE tokenizer that converts text into token IDs | |
L02 | Giving numbers meaning — Transform tokens into vectors and add positional information |
Core Architecture: The Attention Mechanism¶
Lesson | Title | What You’ll Learn |
|---|---|---|
L03 | How words talk to each other — Implement Query-Key-Value attention and understand parallel processing | |
L04 | Why eight brains are better than one — Split attention into multiple heads for richer representations | |
L05 | The plumbing of deep networks — Stabilize training with LayerNorm and residual connections | |
L06 | How to stop cheating — Prevent the model from seeing future tokens during training |
Assembly and Training¶
Lesson | Title | What You’ll Learn |
|---|---|---|
L07 | The grand finale — Stack all components into a complete decoder-only Transformer | |
L08 | Learning to speak — Implement the training loop with AdamW, learning rate schedules, and gradient accumulation |
Inference and Fine-tuning¶
Lesson | Title | What You’ll Learn |
|---|---|---|
L09 | Controlling the creativity — Generate text with temperature, top-p sampling, and beam search | |
L10 | Transforming into a chat assistant — Apply SFT and RLHF to create helpful, harmless AI assistants |
Prerequisites¶
This series assumes:
Python fundamentals (functions, classes, basic numpy)
High school math (algebra, basic calculus concepts)
Neural network basics (what a layer is, forward/backward pass)
New to neural networks? Start with our Neural Networks From Scratch series!
We’ll explain everything else from scratch, including:
Matrix operations and why they matter
Backpropagation and gradient flow
PyTorch fundamentals as we go
Philosophy¶
Code First, Math Second: Every concept is implemented in PyTorch before diving into equations.
Visual Intuition: We use diagrams, plots, and animations to build intuition before formalism.
No Magic: We demystify research papers by showing the gap between “attention is all you need” and “here’s how it actually works.”
Production-Aware: We explain not just what works, but why certain choices (Pre-Norm vs Post-Norm, AdamW vs SGD) became industry standard.
How to Use This Series¶
Sequential Reading: Lessons build on each other—start from L01
Run the Code: Each lesson is an executable Jupyter notebook
Pause and Experiment: Modify parameters, break things, rebuild understanding
Skip What You Know: Familiar with embeddings? Jump ahead (but skim for our specific approach)
Ready to Begin?¶
Let’s start with the first step in any LLM pipeline: teaching computers to read.
Next: L01 - Tokenization From Scratch →
Additional Resources¶
Research Papers: Each lesson links to relevant papers (Attention is All You Need, GPT-2, etc.)
PyTorch Docs: We reference official documentation for implementation details
Further Reading: Suggested deep dives for those who want more mathematical rigor
This series is designed for engineers and researchers who want to understand LLMs deeply enough to modify, debug, and innovate on them.