Principle:Hiyouga LLaMA Factory Distributed Training
| Knowledge Sources | |
|---|---|
| Domains | Distributed Systems, Deep Learning |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Distributed training in LLaMA-Factory encompasses multiple parallelism strategies -- including FSDP, FSDP2, DeepSpeed, and data parallelism -- orchestrated through a unified accelerator interface and launcher system that automatically handles multi-GPU and multi-node execution.
Description
Training large language models often exceeds the memory capacity of a single GPU, requiring the model's parameters, gradients, and optimizer states to be distributed across multiple devices. LLaMA-Factory supports several distributed training paradigms:
- Fully Sharded Data Parallelism (FSDP/FSDP2): Shards model parameters, gradients, and optimizer states across data-parallel workers. FSDP2 (PyTorch native) uses
fully_shardwith configurable mixed-precision policies, CPU offloading, and per-layer sharding granularity. - DeepSpeed ZeRO: A family of memory optimization stages (ZeRO-1 through ZeRO-3) that progressively shard optimizer states, gradients, and parameters across workers.
- Data Parallelism (DDP): Replicates the model on each device and synchronizes gradients after each backward pass.
- Elastic Training: Support for fault-tolerant distributed training with configurable rendezvous backends, max restarts, and elastic node counts.
The architecture is organized into several layers:
Launcher layer: Both the v0 (launcher.py) and v1 (v1/launcher.py) launchers detect the distributed environment, determine the number of available devices, and automatically invoke torchrun with appropriate arguments for multi-GPU training. They support multi-node configurations via NNODES, NODE_RANK, MASTER_ADDR, and MASTER_PORT environment variables.
Accelerator interface layer (v1): The DistributedInterface is a singleton that initializes process groups, constructs device meshes for both model parallelism and data parallelism, and provides collective communication primitives (all_gather, all_reduce, broadcast, barrier). It supports multiple device types (CUDA, NPU, XPU, MPS).
Helper layer (v1): The helper module provides low-level utilities for querying rank, world size, device type, and executing collective operations on tensor-like data (tensors, numpy arrays, scalars).
FSDP2 engine: The FSDP2Engine handles model preparation by identifying transformer layer classes via _no_split_modules, applying per-layer fully_shard with mixed-precision policies, enabling gradient checkpointing, and loading weights from either HuggingFace checkpoints or distributed checkpoints (DCP).
Usage
Distributed training is used when:
- The model does not fit in a single GPU's memory (use FSDP2 or DeepSpeed ZeRO-3).
- Training needs to be scaled across multiple GPUs for throughput (use DDP or FSDP).
- Multi-node clusters are available (configure
NNODES,MASTER_ADDR,MASTER_PORT). - Fault tolerance is required (set
RDZV_IDandMAX_RESTARTSfor elastic training). - NPU or XPU hardware is used (the accelerator interface auto-detects and uses appropriate backends: NCCL for CUDA, HCCL for NPU, Gloo for CPU).
Theoretical Basis
Distributed training strategies differ in how they partition model state across workers:
| Strategy | Parameters | Gradients | Optimizer States | Communication |
|---|---|---|---|---|
| DDP | Replicated | All-reduce | Replicated | per step |
| ZeRO-1 | Replicated | All-reduce | Sharded | per step |
| ZeRO-2 | Replicated | Sharded | Sharded | per step |
| ZeRO-3 / FSDP | Sharded | Sharded | Sharded | per step + all-gather |
where is the total parameter count.
FSDP2 wraps each transformer layer with fully_shard, which:
- Shards parameters across the data-parallel group using DTensor.
- All-gathers the full parameters before each forward pass.
- Optionally reshards parameters after the forward pass to free memory.
- Reduce-scatters gradients during the backward pass.
The memory per GPU is approximately:
where is parameters, is gradients, is optimizer states, is the number of GPUs, and is activation memory (which benefits from gradient checkpointing).
The mixed-precision policy in FSDP2 maintains parameters in reduced precision (bf16/fp16) while accumulating gradient reductions in fp32:
MixedPrecisionPolicy(
param_dtype=torch.bfloat16,
reduce_dtype=torch.float32,
cast_forward_inputs=True,
)
The device mesh abstraction organizes GPUs into multi-dimensional grids supporting both model parallelism and data parallelism:
# Model parallel mesh: (replicate, shard)
model_device_mesh = init_device_mesh(
device_type="cuda",
mesh_shape=(mp_replicate_size, mp_shard_size),
mesh_dim_names=("mp_replicate", "mp_shard"),
)
# Data parallel mesh: (dp, cp)
data_device_mesh = init_device_mesh(
device_type="cuda",
mesh_shape=(dp_size, cp_size),
mesh_dim_names=("dp", "cp"),
)
This enables flexible composition of parallelism dimensions (e.g., combining FSDP across 4 GPUs with 2-way context parallelism).
The launcher uses torchrun for process management and configures DDP optimizations:
env["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
env["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"
These settings improve memory allocation efficiency and reduce NCCL memory overhead in multi-GPU training.