Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

LLM From Scratch: Building GPT From First Principles

A complete journey from text to tokens to transformers to chat assistants


Welcome to LLM From Scratch, a hands-on series that demystifies Large Language Models by building a GPT-style transformer from the ground up.

This isn’t just theory—by the end of this series, you’ll understand every component that powers systems like ChatGPT, Claude, and Llama. We’ll write the code, visualize the math, and connect the dots between research papers and real implementations.

What You’ll Build

Over 10 lessons, we’ll construct a complete GPT architecture:

The Series

Foundation: Text to Tensors

Lesson

Title

What You’ll Learn

L01

Tokenization From Scratch

Teaching computers to read — Build a BPE tokenizer that converts text into token IDs

L02

Embeddings & Positional Encoding

Giving numbers meaning — Transform tokens into vectors and add positional information

Core Architecture: The Attention Mechanism

Lesson

Title

What You’ll Learn

L03

Self-Attention: The Search Engine of Language

How words talk to each other — Implement Query-Key-Value attention and understand parallel processing

L04

Multi-Head Attention

Why eight brains are better than one — Split attention into multiple heads for richer representations

L05

Normalization & Residuals

The plumbing of deep networks — Stabilize training with LayerNorm and residual connections

L06

The Causal Mask

How to stop cheating — Prevent the model from seeing future tokens during training

Assembly and Training

Lesson

Title

What You’ll Learn

L07

Assembling the GPT

The grand finale — Stack all components into a complete decoder-only Transformer

L08

Training the LLM

Learning to speak — Implement the training loop with AdamW, learning rate schedules, and gradient accumulation

Inference and Fine-tuning

Lesson

Title

What You’ll Learn

L09

Inference & Sampling

Controlling the creativity — Generate text with temperature, top-p sampling, and beam search

L10

Fine-tuning: From Completion to Conversation

Transforming into a chat assistant — Apply SFT and RLHF to create helpful, harmless AI assistants

Prerequisites

This series assumes:

We’ll explain everything else from scratch, including:

Philosophy

Code First, Math Second: Every concept is implemented in PyTorch before diving into equations.

Visual Intuition: We use diagrams, plots, and animations to build intuition before formalism.

No Magic: We demystify research papers by showing the gap between “attention is all you need” and “here’s how it actually works.”

Production-Aware: We explain not just what works, but why certain choices (Pre-Norm vs Post-Norm, AdamW vs SGD) became industry standard.

How to Use This Series

  1. Sequential Reading: Lessons build on each other—start from L01

  2. Run the Code: Each lesson is an executable Jupyter notebook

  3. Pause and Experiment: Modify parameters, break things, rebuild understanding

  4. Skip What You Know: Familiar with embeddings? Jump ahead (but skim for our specific approach)

Ready to Begin?

Let’s start with the first step in any LLM pipeline: teaching computers to read.

Next: L01 - Tokenization From Scratch →


Additional Resources

This series is designed for engineers and researchers who want to understand LLMs deeply enough to modify, debug, and innovate on them.