Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

IN01 - KV Cache Deep Dive: Memory Optimization for LLM Inference [DRAFT]

The critical optimization that makes LLM inference practical — and the memory tradeoff that comes with it.

In this post, we’ll cover:

No PhD required — just curiosity about how production LLMs actually work.

The Problem: Autoregressive Generation Is Expensive

LLMs generate text one token at a time. Each new token depends on all previous tokens (prompt + already-generated tokens).

If you do the “naive” thing, every time you generate a token you re-run lots of the same work again and again.

A concrete toy example (visual): prompt tokens → keys/values (and what “all layers” means)

Let’s make this tangible with a tiny, made-up transformer. The exact token splits depend on the tokenizer, but the mechanics are the same.

Step 0 — From raw text to tokens (what “BPE” means)

Most LLMs don’t operate on words directly; they operate on tokens produced by a tokenizer. A common tokenizer family is BPE (Byte Pair Encoding): it starts from bytes/characters and repeatedly merges the most frequent adjacent pairs to form a vocabulary of “subword” units.

So you might see splits like:

Sometimes you’ll get surprises (also illustrative), like splitting uncommon words:

The key point: the model sees a sequence of token IDs, not words.

Step 1 — A tiny transformer (2 layers, 2 heads)

We’ll use a toy setup:

At each layer ℓ, the model holds hidden states for every token:

Then it computes projections:

With multi-head attention, think of K and V as stored per head:

Here’s what “all layers” means visually: each layer has its own K/V cache.

Step 2 — What happens when you generate the next token?

When generating the next token (call it t4):

Step 3 — A tiny “numbers” example (one head, one layer)

To make attention feel less abstract, here’s a toy single-head example at one layer.

Assume d_head = 2 and we already cached keys/values for the 4 prompt tokens:

K cache (4 tokens × 2 dims)

tokenk = [k1, k2]
t0 (“I”)[2, 0]
t1 (" love")[1, 0]
t2 (" Rust")[0, 0]
t3 (“!”)[-1, 0]

V cache (4 tokens × 2 dims)

tokenv = [v1, v2]
t0[10, 0]
t1[5, 0]
t2[1, 0]
t3[0, 0]

Now we’re generating the next token t4. The model computes a new query for the current position:

Attention scores are dot products (scaled by sqrt(d_head) in real models; we’ll ignore scaling here for simplicity):

Softmax turns these into weights (rounded):

Finally, the attention output is a weighted sum of values:

Two important takeaways:

  1. The K/V for t0..t3 were reused — we didn’t recompute them during decode.

  2. The new query still compares against all cached keys — long context still costs.

If you want an intuition anchor:

Without KV Cache (naive decode loop)

With KV Cache (practical decode loop)

Prefill vs Decode: The Two-Phase Mental Model

It helps to split inference into two phases:

KV cache helps most during decode, because decode is where repeated work would otherwise explode.

Transformer Attention: Q, K, V in One Page

Attention is easiest to understand if you treat Q/K/V as roles:

The attention formula is typically written as:

[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V ]

You don’t need to memorize it. The important bit is:

Here’s the flow:

The KV Cache Optimization

The Revelation

During decode, for previous tokens:

So recomputing K/V for them every decode step is pure waste.

Pseudocode: without vs with cache

Without KV cache (wasteful):

# At each decode step:
Q, K, V = compute_QKV_for_all_tokens(context_tokens)  # recompute everything
y = attention(Q, K, V)
next_token = sample(y)

With KV cache (efficient projections):

# Prefill once:
K_cache, V_cache = compute_KV_for_prompt(prompt_tokens)

# Decode loop:
for t in range(num_new_tokens):
    q_new, k_new, v_new = compute_QKV_for_new_token(last_token)
    K_cache.append(k_new)
    V_cache.append(v_new)

    y = attention(q_new, K_cache, V_cache)
    next_token = sample(y)

What you saved: recomputing K/V projections for all previous tokens at every step.

What KV Cache Does Not Fix

KV cache removes repeated K/V projection work, but it does not remove the need to do attention against the context.

During decode, the model still needs to compute something equivalent to:

So even with KV cache, decode attention work still grows with context length.

Practical takeaway:
KV cache is necessary — but long context still costs you (compute + memory).

Speedup Analysis (Accurate Version)

Let:

A simplified breakdown per decode step:

Component (per step)Without KV CacheWith KV Cache
K/V projections for previous tokens(O(n \cdot L \cdot d))0 (reused)
K/V projections for new token(O(L \cdot d))(O(L \cdot d))
Attention vs context (new token attends to n tokens)(O(n \cdot L \cdot d))(O(n \cdot L \cdot d))
Total per stepbig constant * (O(nLd)) + redundant worksmaller constant * (O(nLd))

So why is KV cache such a big deal?

Because without KV cache, the projection work for old tokens repeats every step and your total work across a response can grow roughly quadratically with the number of generated tokens.

With KV cache, projection work is closer to linear in generated tokens (plus the attention term, which remains).

The Memory Tradeoff: Cache Size Explodes at Scale

KV cache stores K and V tensors for each token, for each layer.

A very common back-of-the-envelope estimate:

[ \text{KV Memory} \approx 2 \times L \times n \times d \times p ]

Where:

Here’s a visual factor breakdown:

Real-World Example: Llama 2 70B (illustrative)

Assume:

[ \text{Memory} = 2 \times 80 \times 4096 \times 8192 \times 2 \approx 10.7 \text{ GB (decimal)} \approx 10.0 \text{ GiB} ]

That’s per request for KV cache.

Now multiply by concurrency.
Batch 32: ~(10.7 \times 32 \approx 342) GB of KV cache memory.

And this is separate from model weights (e.g., ~140 GB for a 70B model in FP16).

Why Memory Fragmentation Shows Up

Even if your GPU has “enough free memory” in total, allocations can fail if free space is chopped into non-contiguous holes.

This becomes more likely with:

Modern Solution: Cache Disaggregation

If KV cache dominates memory, one idea is to avoid keeping all KV cache in GPU memory.

Instead:

Why this helps:

Tradeoffs:

Note: Different vendors and open-source stacks approach disaggregated caching differently. The important idea is the architecture pattern (separating weights + active KV on GPU from a larger KV pool elsewhere), not any single implementation.

When Disaggregation Is Worth It

ScenarioTraditional KV CacheDisaggregated Cache
Short contexts (<2K)✅ simplest + fast⚠ overhead not worth it
Long contexts (>8K)❌ GPU memory bottleneck✅ scales beyond GPU memory
High concurrency❌ hits memory ceiling fast✅ higher headroom
Shared prefixes (some workloads)⚠ limited reuse✅ potential reuse wins
Low-latency single-user✅ best latency⚠ extra indirection

Practical Engineering Checklist

If you’re building or evaluating an LLM serving stack, ask:

  1. What is our max context and typical prompt length?

  2. What is our target concurrency (p95/p99), not just average?

  3. Are we memory-bound on KV cache or compute-bound on decode?

  4. Can we reduce KV footprint (precision/quantization choices) without harming quality?

  5. Do we need techniques like paging / working-set KV for long context?

  6. Are we measuring prefill vs decode separately (latency + throughput)?

  7. Do we have guardrails for fragmentation and variable-length spikes?

Summary

If you want a follow-up post, the natural sequel is:

IN02 — KV Cache Meets Serving: Prefill/Decode, Continuous Batching, and Where Latency Actually Goes