Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Volcengine Verl FSDP Distributed Training

From Leeroopedia


Knowledge Sources
Domains Distributed_Systems, Training_Infrastructure, Deep_Learning
Last Updated 2026-02-07 14:00 GMT

Overview

A distributed training strategy that shards model parameters, gradients, and optimizer states across multiple GPUs, enabling training of models that exceed single-GPU memory capacity.

Description

Fully Sharded Data Parallel (FSDP) training distributes model parameters across GPUs and gathers them on-demand for computation. This is the primary training backend in verl for both SFT and RL training.

FSDP works by:

  1. Sharding: Model parameters are split across all GPUs in the training group
  2. All-gather: Before forward/backward pass, parameters are gathered to each GPU
  3. Reduce-scatter: After backward pass, gradients are reduced and re-sharded
  4. Optimizer step: Each GPU updates only its shard of parameters

verl's SFT trainer (FSDPSFTTrainer) implements a complete training loop with FSDP, supporting:

  • Mixed-precision training (bf16/fp16)
  • Gradient accumulation across micro-batches
  • Gradient clipping for training stability
  • Cosine or WSD learning rate scheduling

Usage

FSDP distributed training is used whenever training models that require multiple GPUs for memory or throughput. In verl:

  • SFT training always uses FSDP via torchrun
  • RL training uses FSDP for actor/critic training steps (while rollout uses vLLM/SGLang)

Theoretical Basis

FSDP implements a sharded data parallel strategy (equivalent to ZeRO-3):

# Abstract FSDP training loop
for batch in dataloader:
    # Gradient accumulation over micro-batches
    for micro_batch in split(batch, micro_batch_size):
        loss = model.forward(micro_batch) / accumulation_steps
        loss.backward()
    # Gradient clipping
    clip_grad_norm_(model.parameters(), max_norm=clip_grad)
    # Optimizer step
    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()

Related Pages

Implemented By

Heuristics Used

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment