Principle:Hiyouga LLaMA Factory Distributed Training

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Distributed Systems, Deep Learning
Last Updated	2026-02-06 19:00 GMT

Overview

Distributed training in LLaMA-Factory encompasses multiple parallelism strategies -- including FSDP, FSDP2, DeepSpeed, and data parallelism -- orchestrated through a unified accelerator interface and launcher system that automatically handles multi-GPU and multi-node execution.

Description

Training large language models often exceeds the memory capacity of a single GPU, requiring the model's parameters, gradients, and optimizer states to be distributed across multiple devices. LLaMA-Factory supports several distributed training paradigms:

Fully Sharded Data Parallelism (FSDP/FSDP2): Shards model parameters, gradients, and optimizer states across data-parallel workers. FSDP2 (PyTorch native) uses fully_shard with configurable mixed-precision policies, CPU offloading, and per-layer sharding granularity.
DeepSpeed ZeRO: A family of memory optimization stages (ZeRO-1 through ZeRO-3) that progressively shard optimizer states, gradients, and parameters across workers.
Data Parallelism (DDP): Replicates the model on each device and synchronizes gradients after each backward pass.
Elastic Training: Support for fault-tolerant distributed training with configurable rendezvous backends, max restarts, and elastic node counts.

The architecture is organized into several layers:

Launcher layer: Both the v0 (launcher.py) and v1 (v1/launcher.py) launchers detect the distributed environment, determine the number of available devices, and automatically invoke torchrun with appropriate arguments for multi-GPU training. They support multi-node configurations via NNODES, NODE_RANK, MASTER_ADDR, and MASTER_PORT environment variables.

Accelerator interface layer (v1): The DistributedInterface is a singleton that initializes process groups, constructs device meshes for both model parallelism and data parallelism, and provides collective communication primitives (all_gather, all_reduce, broadcast, barrier). It supports multiple device types (CUDA, NPU, XPU, MPS).

Helper layer (v1): The helper module provides low-level utilities for querying rank, world size, device type, and executing collective operations on tensor-like data (tensors, numpy arrays, scalars).

FSDP2 engine: The FSDP2Engine handles model preparation by identifying transformer layer classes via _no_split_modules, applying per-layer fully_shard with mixed-precision policies, enabling gradient checkpointing, and loading weights from either HuggingFace checkpoints or distributed checkpoints (DCP).

Usage

Distributed training is used when:

The model does not fit in a single GPU's memory (use FSDP2 or DeepSpeed ZeRO-3).
Training needs to be scaled across multiple GPUs for throughput (use DDP or FSDP).
Multi-node clusters are available (configure NNODES, MASTER_ADDR, MASTER_PORT).
Fault tolerance is required (set RDZV_ID and MAX_RESTARTS for elastic training).
NPU or XPU hardware is used (the accelerator interface auto-detects and uses appropriate backends: NCCL for CUDA, HCCL for NPU, Gloo for CPU).

Theoretical Basis

Distributed training strategies differ in how they partition model state across $N$ workers:

Strategy	Parameters	Gradients	Optimizer States	Communication
DDP	Replicated	All-reduce	Replicated	$O (P)$ per step
ZeRO-1	Replicated	All-reduce	Sharded	$O (P)$ per step
ZeRO-2	Replicated	Sharded	Sharded	$O (P)$ per step
ZeRO-3 / FSDP	Sharded	Sharded	Sharded	$O (P)$ per step + all-gather

where $P$ is the total parameter count.

FSDP2 wraps each transformer layer with fully_shard, which:

Shards parameters across the data-parallel group using DTensor.
All-gathers the full parameters before each forward pass.
Optionally reshards parameters after the forward pass to free memory.
Reduce-scatters gradients during the backward pass.

The memory per GPU is approximately:

$Memory per GPU \approx \frac{P + G + O}{N} + A$

where $P$ is parameters, $G$ is gradients, $O$ is optimizer states, $N$ is the number of GPUs, and $A$ is activation memory (which benefits from gradient checkpointing).

The mixed-precision policy in FSDP2 maintains parameters in reduced precision (bf16/fp16) while accumulating gradient reductions in fp32:

MixedPrecisionPolicy(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.float32,
    cast_forward_inputs=True,
)

The device mesh abstraction organizes GPUs into multi-dimensional grids supporting both model parallelism and data parallelism:

# Model parallel mesh: (replicate, shard)
model_device_mesh = init_device_mesh(
    device_type="cuda",
    mesh_shape=(mp_replicate_size, mp_shard_size),
    mesh_dim_names=("mp_replicate", "mp_shard"),
)
# Data parallel mesh: (dp, cp)
data_device_mesh = init_device_mesh(
    device_type="cuda",
    mesh_shape=(dp_size, cp_size),
    mesh_dim_names=("dp", "cp"),
)

This enables flexible composition of parallelism dimensions (e.g., combining FSDP across 4 GPUs with 2-way context parallelism).

The launcher uses torchrun for process management and configures DDP optimizations:

env["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
env["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"

These settings improve memory allocation efficiency and reduce NCCL memory overhead in multi-GPU training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment