Training models too large for a single GPU
A 70B parameter model requires 280 GB in FP32, but consumer GPUs max out at 24 GB. Model parallelism splits the model across multiple GPUs. This lesson covers the three main strategies.
By the end of this post, you’ll understand:
Data parallelism (simplest, for small models)
Pipeline parallelism (split layers across GPUs)
Tensor parallelism (split individual layers)
ZeRO optimizer (memory-efficient training)
Part 1: Data Parallelism¶
The Simplest Strategy¶
Idea: Replicate the entire model on each GPU, process different batches.
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
# Initialize process group
dist.init_process_group(backend='nccl')
# Create model on this GPU
rank = dist.get_rank()
model = GPT(config).to(rank)
model = DDP(model, device_ids=[rank])
# Each GPU gets different data
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
loader = DataLoader(dataset, sampler=sampler)
for batch in loader:
# Forward pass (local)
loss = model(batch)
# Backward pass
loss.backward()
# All-reduce gradients (average across GPUs)
optimizer.step()How it works:
Each GPU processes a different mini-batch
After backward pass, gradients are averaged across all GPUs
All GPUs update with the same averaged gradient
Limitations:
Model must fit on ONE GPU
Doesn’t help for 70B models!
Part 2: Pipeline Parallelism¶
Split Layers Across GPUs¶
Idea: Put different layers on different GPUs, pass activations between them.
GPU 0: Layers 0-7 (Embedding + first 8 blocks)
GPU 1: Layers 8-15 (Middle 8 blocks)
GPU 2: Layers 16-23 (Last 8 blocks)
GPU 3: Final head (LM head)Naive Pipeline (Sequential)¶
class PipelineParallelGPT(nn.Module):
def __init__(self, config):
super().__init__()
# Split layers across GPUs
self.embed = nn.Embedding(vocab_size, d_model).to('cuda:0')
self.blocks_0_7 = nn.ModuleList([
TransformerBlock(config) for _ in range(8)
]).to('cuda:0')
self.blocks_8_15 = nn.ModuleList([
TransformerBlock(config) for _ in range(8)
]).to('cuda:1')
self.blocks_16_23 = nn.ModuleList([
TransformerBlock(config) for _ in range(8)
]).to('cuda:2')
self.lm_head = nn.Linear(d_model, vocab_size).to('cuda:3')
def forward(self, input_ids):
# GPU 0
x = self.embed(input_ids.to('cuda:0'))
for block in self.blocks_0_7:
x = block(x)
# GPU 1
x = x.to('cuda:1')
for block in self.blocks_8_15:
x = block(x)
# GPU 2
x = x.to('cuda:2')
for block in self.blocks_16_23:
x = block(x)
# GPU 3
x = x.to('cuda:3')
logits = self.lm_head(x)
return logitsProblem: GPU utilization is terrible!
Time: 0 1 2 3 4 5 6 7 8
GPU 0: [████]
GPU 1: [████]
GPU 2: [████]
GPU 3: [████]
Utilization: 25%! (3 GPUs idle at any time)GPipe: Pipeline with Micro-Batches¶
Solution: Split batch into micro-batches, keep GPUs busy.
# Split batch into 4 micro-batches
batch_size = 32
micro_batch_size = 8
num_micro_batches = batch_size // micro_batch_size
for micro_batch_id in range(num_micro_batches):
start = micro_batch_id * micro_batch_size
end = start + micro_batch_size
micro_batch = input_ids[start:end]
# Process through pipeline (while next micro-batch starts on GPU 0)
logits = model(micro_batch)Timeline with micro-batches:
Time: 0 1 2 3 4 5 6 7 8 9 10
GPU 0: [m0][m1][m2][m3]
GPU 1: [m0][m1][m2][m3]
GPU 2: [m0][m1][m2][m3]
GPU 3: [m0][m1][m2][m3]
Utilization: ~75%! (much better)Library: Use torch.distributed.pipeline.sync.Pipe
Part 3: Tensor Parallelism¶
Split Individual Layers¶
Instead of splitting layers, split the weights within each layer.
Example: Split feed-forward layer across 2 GPUs
# Standard FF (4096 → 16384 → 4096)
class FeedForward(nn.Module):
def __init__(self):
self.fc1 = nn.Linear(4096, 16384) # 67M params
self.fc2 = nn.Linear(16384, 4096) # 67M params
# Tensor Parallel FF
class TensorParallelFF(nn.Module):
def __init__(self):
# Split fc1 along output dimension
self.fc1_gpu0 = nn.Linear(4096, 8192).to('cuda:0') # 33M params
self.fc1_gpu1 = nn.Linear(4096, 8192).to('cuda:1') # 33M params
# Split fc2 along input dimension
self.fc2_gpu0 = nn.Linear(8192, 4096).to('cuda:0') # 33M params
self.fc2_gpu1 = nn.Linear(8192, 4096).to('cuda:1') # 33M params
def forward(self, x):
# Each GPU computes half the outputs
x0 = F.gelu(self.fc1_gpu0(x.to('cuda:0')))
x1 = F.gelu(self.fc1_gpu1(x.to('cuda:1')))
# Each GPU computes partial fc2
y0 = self.fc2_gpu0(x0)
y1 = self.fc2_gpu1(x1)
# All-reduce to combine results
y = y0.to('cuda:0') + y1.to('cuda:0')
return yKey insight: Matrix multiplication is easy to split!
Each GPU computes one partition, then concatenate.
Library: Use torch.distributed.tensor.parallel
Part 4: ZeRO Optimizer¶
The Memory Breakdown (7B model)¶
| Component | Memory |
|---|---|
| Model weights | 28 GB |
| Gradients | 28 GB |
| Optimizer states | 56 GB |
| Activations | 10 GB |
| Total | 122 GB |
Observation: Optimizer states (Adam’s momentum and variance) dominate!
ZeRO: Zero Redundancy Optimizer¶
Idea: Shard optimizer states across GPUs.
ZeRO Stage 1: Partition optimizer states
from deepspeed import DeepSpeedConfig
config = {
"zero_optimization": {
"stage": 1, # Shard optimizer states only
}
}
# Each GPU stores 1/N of optimizer states
# 56 GB / 8 GPUs = 7 GB per GPUZeRO Stage 2: Also partition gradients
config = {
"zero_optimization": {
"stage": 2, # Shard optimizer states + gradients
}
}
# 56 GB optimizer + 28 GB gradients = 84 GB / 8 = 10.5 GB per GPUZeRO Stage 3: Also partition model weights
config = {
"zero_optimization": {
"stage": 3, # Shard everything!
}
}
# Total 122 GB / 8 GPUs = 15 GB per GPUTrade-off: More communication between GPUs, but enables training huge models.
Part 5: Choosing the Right Strategy¶
| Model Size | Strategy | Why |
|---|---|---|
| < 1B | Data Parallelism | Fits on one GPU, just need speed |
| 1B - 10B | Pipeline Parallelism | Too big for one GPU, moderate communication |
| 10B - 100B | Tensor Parallelism + ZeRO | Need to split layers AND optimizer |
| 100B+ | All of the above (3D parallelism) | Maximum splitting |
Example (GPT-3 175B):
64 GPUs total
8-way data parallelism (8 replicas)
4-way pipeline parallelism (4 stages)
2-way tensor parallelism (2 GPUs per layer)
ZeRO Stage 2
Part 6: Practical Example with DeepSpeed¶
import deepspeed
# DeepSpeed config
ds_config = {
"train_batch_size": 256,
"gradient_accumulation_steps": 8,
"zero_optimization": {
"stage": 2, # ZeRO Stage 2
"offload_optimizer": {
"device": "cpu", # Offload to CPU for even more memory
"pin_memory": True
}
},
"fp16": {
"enabled": True,
"loss_scale": 0,
"initial_scale_power": 16
}
}
# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
model=model,
config=ds_config
)
# Training loop (same as before!)
for batch in train_loader:
loss = model_engine(batch)
model_engine.backward(loss)
model_engine.step()What DeepSpeed handles:
Sharding parameters across GPUs
Communication between GPUs
CPU offloading
Mixed precision
Gradient checkpointing
Summary¶
Data Parallelism: Replicate model, process different batches (small models)
Pipeline Parallelism: Split layers across GPUs (medium models)
Tensor Parallelism: Split individual layer weights (large models)
ZeRO: Shard optimizer states to save memory (critical for huge models)
DeepSpeed: Production library that handles everything
Next Up: L18 – Long Context Handling. How models handle 100k+ token contexts!