Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

L07 - Assembling the GPT: The Grand Finale [DRAFT]

Stacking the blocks to build a complete Decoder-only Transformer


We have spent the last six blogs building individual components:

Now, we wrap them all into a single class. A GPT is essentially a stack of “Transformer Blocks” followed by a final linear layer that maps our vectors back into the vocabulary space to predict the next word.

By the end of this post, you’ll understand:


Part 1: The Transformer Block

A single block in a GPT model has two main sections:

  1. Communication: The Multi-Head Attention layer where tokens talk to each other.

  2. Computation: A Feed-Forward Network (an MLP!) where each token processes its new information individually.

Each section is wrapped in a Residual Connection and Layer Normalization.

The Feed-Forward Network (FFN) - The “Thinking Step”

After tokens gather context from each other via attention, each token needs to process that information independently. This is where the FFN comes in.

Structure:

nn.Sequential(
    nn.Linear(d_model, 4 * d_model),  # Expand
    nn.ReLU(),                         # Non-linearity
    nn.Linear(4 * d_model, d_model),  # Compress
)

Why the 4× expansion?

Purpose:

Intuition: After attention, each token has updated its representation based on context. The FFN is like saying: “Now that you know what’s around you, think deeply about what you’ve learned and update your representation accordingly.”


Part 2: The Complete Architecture - Layer by Layer

If we look at the model from top to bottom, it looks like a factory assembly line. Here’s the complete data flow:

The Full GPT Architecture

Key Components Explained:

  1. Token Embedding: Converts integer IDs into dense vectors

  2. Positional Encoding: Adds position information (either sinusoidal or learned)

  3. N Transformer Blocks: Each block refines the representation (typical N = 12, 24, or 96)

  4. Final LayerNorm: One last normalization before prediction

  5. LM Head: Projects back to vocabulary space to predict next token


Part 3: Visualizing the “Hidden States”

As data moves through the blocks, each token’s vector changes. We call these Hidden States.

<Figure size 1500x400 with 4 Axes>

Understanding the Magnitude Decrease

Notice in the visualization above how the values become smaller (less variance) as we move through deeper layers. This isn’t a bug—it’s an expected and important phenomenon!

Why does magnitude decrease?

  1. Layer Normalization: Each LayerNorm operation (used twice per block) forces the mean to 0 and standard deviation to 1. Across many layers, this has a dampening effect on extreme values.

  2. Residual Connections: While they help with gradient flow, they also mean that changes accumulate gradually rather than dramatically. Each layer adds a small delta to the input.

  3. Attention Smoothing: The softmax in attention creates weighted averages. Averaging tends to reduce extreme values and create smoother distributions.

Is this good or bad?

Good! This is actually desirable:

The key insight: The final LayerNorm before the LM head rescales these values appropriately before making predictions. The model learns to work within this normalized regime.

What to watch for: If magnitudes approach zero or if all activations look identical, that could indicate a problem (dead neurons, vanishing gradients). But gradual decrease with maintained structure is healthy.


Part 4: Implementation in PyTorch

Let’s assemble the GPT class. Note how we use nn.ModuleList to stack our blocks.

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ln2 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.ReLU(),
            nn.Linear(4 * d_model, d_model),
        )

    def forward(self, x):
        # Communication (with Residual)
        x = x + self.attn(self.ln1(x))
        # Computation (with Residual)
        x = x + self.ff(self.ln2(x))
        return x

class GPT(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers, n_heads, max_len):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.pos_embedding = nn.Parameter(torch.zeros(1, max_len, d_model))
        
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads) for _ in range(n_layers)
        ])
        
        self.ln_final = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size)

    def forward(self, idx):
        b, t = idx.size()
        x = self.token_embedding(idx) + self.pos_embedding[:, :t, :]
        
        for block in self.blocks:
            x = block(x)
            
        x = self.ln_final(x)
        logits = self.lm_head(x) # Scores for every word in the vocab
        return logits

Summary

  1. The Stack: We build a deep model by repeating the Transformer Block.

  2. Hidden States: Each layer refine’s the token’s meaning based on context.

  3. The Head: The final layer is just a classifier that asks: “Based on everything I’ve seen, which token comes next?”

Next Up: L08 – Training the Model. We have the machine, but it’s currently “brain dead” with random weights. We’ll learn how to feed it data, compute the loss, and watch it learn to speak.