Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

L09 - Inference & Sampling: Controlling the Creativity [DRAFT]

How to talk to a trained brain without it repeating itself


We have a trained GPT model from L08 - Training the LLM. If we give it a prompt, the model will output a list of probabilities for the next word.

This post covers the “magic” of how an LLM actually generates text. Training is about building a probability map; Inference is about walking through that map. We will also explore the “knobs” we turn to make the model more creative or more factual.

But how do we choose the “best” word? If we always pick the most likely word (Greedy Search), the model often gets stuck in a loop, repeating the same phrase over and over. To make the model sound human, we need to introduce a bit of randomness.

By the end of this post, you’ll understand:


Part 1: The Autoregressive Loop

Generating text is a loop. The model predicts one token, we append that token to the input, and we feed the new, longer sequence back into the model to get the next token.

  1. Prompt: “The cat”

  2. Model predicts: “sat” (Prob: 0.8)

  3. New Input: “The cat sat”

  4. Model predicts: “on” (Prob: 0.7)

  5. Repeat until a special “End of Sentence” token is generated.


Part 2: Temperature (The “Creativity” Knob)

Before we pick a word, we can “stretch” or “squash” the probability distribution using Temperature (T).

We divide the raw scores (logits) by T before the Softmax:

pi=ezi/Tezj/Tp_i = \frac{e^{z_i / T}}{\sum e^{z_j / T}}
<Figure size 1400x500 with 2 Axes>

Part 3: Top-K & Top-P Sampling

Even with temperature, sometimes the model picks a word that is just objectively wrong (like a low chance word). To prevent this, we use filters:

Top-K Sampling

We only look at the top K most likely words and ignore everything else. This keeps the model from “veering off the rails.”

Top-P (Nucleus) Sampling

Instead of a fixed number of words, we pick the smallest set of words whose cumulative probability adds up to P (e.g., 0.9). This is more dynamic; if the model is very sure, it might only look at 2 words. If it’s unsure, it might look at 20.

Concrete Example:

Suppose after applying temperature=1.0 to our logits and running softmax, we get these probabilities:

TokenProbabilityCumulative Probability
“the”0.400.40
“a”0.300.70
“this”0.200.90 ← Cutoff at p=0.9
“that”0.050.95
“my”0.030.98
“your”0.021.00

With top_p = 0.9:

  1. Sort tokens by probability (already sorted above)

  2. Add probabilities until we reach 0.9: “the” (0.40) + “a” (0.30) + “this” (0.20) = 0.90

  3. Keep only these 3 tokens: ["the", "a", "this"]

  4. Renormalize: divide each by the sum (0.90) to get a proper probability distribution

  5. Sample from this smaller set

Result: The model can only choose from “the”, “a”, or “this”—cutting off the unlikely tokens “that”, “my”, and “your.”

Why it’s better than top-k:

Visualizing Top-P Filtering:

/tmp/ipykernel_5217/2216869106.py:32: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax1.set_xticklabels(tokens, rotation=45)
<Figure size 1400x600 with 2 Axes>

Key insight from the visualization:

This adaptive filtering keeps quality high while allowing flexibility when the model is uncertain.


Part 4: The Inference Code

Here is how we implement the generation loop in PyTorch, including a simple temperature adjustment.

@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, top_k=None):
    for _ in range(max_new_tokens):
        # 1. Crop idx to the last 'block_size' tokens
        idx_cond = idx[:, -block_size:]
        
        # 2. Forward pass to get logits
        logits = model(idx_cond)
        
        # 3. Focus only on the last time step and scale by temperature
        logits = logits[:, -1, :] / temperature
        
        # 4. Optional: Top-K filtering
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = -float('Inf')
            
        # 5. Softmax to get probabilities
        probs = F.softmax(logits, dim=-1)
        
        # 6. Sample from the distribution
        idx_next = torch.multinomial(probs, num_samples=1)
        
        # 7. Append to the sequence
        idx = torch.cat((idx, idx_next), dim=1)
        
    return idx

Part 5: Advanced Sampling Techniques

max_new_tokens - Controlling Generation Length

The max_new_tokens parameter determines how many tokens the model will generate:

generated = generate(model, prompt, max_new_tokens=50)

What it does:

How models know when to stop naturally:

Typical values:

Repetition Penalty - Preventing Loops

A common problem: models get stuck repeating the same phrase:

"The cat sat on the mat. The cat sat on the mat. The cat sat on..."

Solution: Repetition Penalty

def apply_repetition_penalty(logits, previous_tokens, penalty=1.2):
    for token in set(previous_tokens):
        # Divide logit by penalty (reduces probability)
        logits[token] /= penalty
    return logits

How it works:

Typical values:

Beam Search - Deterministic Exploration

All the sampling methods above are stochastic (random). Beam Search is a deterministic alternative:

How it works:

  1. Instead of sampling 1 token, keep the top beam_width candidates

  2. For each candidate, generate the next token

  3. Evaluate all beam_width² possibilities

  4. Keep only the top beam_width sequences by total probability

  5. Repeat until done

Example with beam_width=2:

Start: "The"
Step 1: Keep ["The cat" (prob=0.8), "The dog" (prob=0.7)]
Step 2: Expand both → ["The cat sat" (0.64), "The cat ran" (0.56),
                       "The dog sat" (0.49), "The dog ran" (0.42)]
        Keep top 2 → ["The cat sat", "The cat ran"]
... continue ...

Beam Search vs. Sampling:

AspectBeam SearchSampling (Top-P/Top-K)
DeterminismAlways same outputDifferent every time
QualityFinds high-probability sequencesMore diverse, creative
Use casesTranslation, summarizationCreative writing, chat
SpeedSlower (beam_width parallel paths)Faster (single path)

When to use:


Summary

  1. Inference is a loop where the model’s output becomes its next input.

  2. Temperature controls how much the model deviates from its most likely guess (sharpness of distribution).

  3. Sampling Strategies (Top-K/Top-P) prune the “long tail” of unlikely words to maintain coherence.

  4. max_new_tokens controls generation length and prevents runaway generation.

  5. Repetition Penalty prevents the model from getting stuck in loops by penalizing already-used tokens.

  6. Beam Search offers a deterministic alternative to sampling, finding high-probability sequences for tasks requiring consistency.

Next Up: L10 – Fine-tuning (RLHF & Chat). We have a model that can complete sentences. But how do we turn it into a helpful assistant that answers questions? We’ll look at the final step: taking a “Base” model and turning it into a “Chat” model.