Principle:Volcengine Verl FSDP Distributed Training
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Systems, Training_Infrastructure, Deep_Learning |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
A distributed training strategy that shards model parameters, gradients, and optimizer states across multiple GPUs, enabling training of models that exceed single-GPU memory capacity.
Description
Fully Sharded Data Parallel (FSDP) training distributes model parameters across GPUs and gathers them on-demand for computation. This is the primary training backend in verl for both SFT and RL training.
FSDP works by:
- Sharding: Model parameters are split across all GPUs in the training group
- All-gather: Before forward/backward pass, parameters are gathered to each GPU
- Reduce-scatter: After backward pass, gradients are reduced and re-sharded
- Optimizer step: Each GPU updates only its shard of parameters
verl's SFT trainer (FSDPSFTTrainer) implements a complete training loop with FSDP, supporting:
- Mixed-precision training (bf16/fp16)
- Gradient accumulation across micro-batches
- Gradient clipping for training stability
- Cosine or WSD learning rate scheduling
Usage
FSDP distributed training is used whenever training models that require multiple GPUs for memory or throughput. In verl:
- SFT training always uses FSDP via
torchrun - RL training uses FSDP for actor/critic training steps (while rollout uses vLLM/SGLang)
Theoretical Basis
FSDP implements a sharded data parallel strategy (equivalent to ZeRO-3):
# Abstract FSDP training loop
for batch in dataloader:
# Gradient accumulation over micro-batches
for micro_batch in split(batch, micro_batch_size):
loss = model.forward(micro_batch) / accumulation_steps
loss.backward()
# Gradient clipping
clip_grad_norm_(model.parameters(), max_norm=clip_grad)
# Optimizer step
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()