Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Hiyouga LLaMA Factory Distributed Training

From Leeroopedia


Knowledge Sources
Domains Distributed Systems, Deep Learning
Last Updated 2026-02-06 19:00 GMT

Overview

Distributed training in LLaMA-Factory encompasses multiple parallelism strategies -- including FSDP, FSDP2, DeepSpeed, and data parallelism -- orchestrated through a unified accelerator interface and launcher system that automatically handles multi-GPU and multi-node execution.

Description

Training large language models often exceeds the memory capacity of a single GPU, requiring the model's parameters, gradients, and optimizer states to be distributed across multiple devices. LLaMA-Factory supports several distributed training paradigms:

  • Fully Sharded Data Parallelism (FSDP/FSDP2): Shards model parameters, gradients, and optimizer states across data-parallel workers. FSDP2 (PyTorch native) uses fully_shard with configurable mixed-precision policies, CPU offloading, and per-layer sharding granularity.
  • DeepSpeed ZeRO: A family of memory optimization stages (ZeRO-1 through ZeRO-3) that progressively shard optimizer states, gradients, and parameters across workers.
  • Data Parallelism (DDP): Replicates the model on each device and synchronizes gradients after each backward pass.
  • Elastic Training: Support for fault-tolerant distributed training with configurable rendezvous backends, max restarts, and elastic node counts.

The architecture is organized into several layers:

Launcher layer: Both the v0 (launcher.py) and v1 (v1/launcher.py) launchers detect the distributed environment, determine the number of available devices, and automatically invoke torchrun with appropriate arguments for multi-GPU training. They support multi-node configurations via NNODES, NODE_RANK, MASTER_ADDR, and MASTER_PORT environment variables.

Accelerator interface layer (v1): The DistributedInterface is a singleton that initializes process groups, constructs device meshes for both model parallelism and data parallelism, and provides collective communication primitives (all_gather, all_reduce, broadcast, barrier). It supports multiple device types (CUDA, NPU, XPU, MPS).

Helper layer (v1): The helper module provides low-level utilities for querying rank, world size, device type, and executing collective operations on tensor-like data (tensors, numpy arrays, scalars).

FSDP2 engine: The FSDP2Engine handles model preparation by identifying transformer layer classes via _no_split_modules, applying per-layer fully_shard with mixed-precision policies, enabling gradient checkpointing, and loading weights from either HuggingFace checkpoints or distributed checkpoints (DCP).

Usage

Distributed training is used when:

  • The model does not fit in a single GPU's memory (use FSDP2 or DeepSpeed ZeRO-3).
  • Training needs to be scaled across multiple GPUs for throughput (use DDP or FSDP).
  • Multi-node clusters are available (configure NNODES, MASTER_ADDR, MASTER_PORT).
  • Fault tolerance is required (set RDZV_ID and MAX_RESTARTS for elastic training).
  • NPU or XPU hardware is used (the accelerator interface auto-detects and uses appropriate backends: NCCL for CUDA, HCCL for NPU, Gloo for CPU).

Theoretical Basis

Distributed training strategies differ in how they partition model state across N workers:

Strategy Parameters Gradients Optimizer States Communication
DDP Replicated All-reduce Replicated O(P) per step
ZeRO-1 Replicated All-reduce Sharded O(P) per step
ZeRO-2 Replicated Sharded Sharded O(P) per step
ZeRO-3 / FSDP Sharded Sharded Sharded O(P) per step + all-gather

where P is the total parameter count.

FSDP2 wraps each transformer layer with fully_shard, which:

  1. Shards parameters across the data-parallel group using DTensor.
  2. All-gathers the full parameters before each forward pass.
  3. Optionally reshards parameters after the forward pass to free memory.
  4. Reduce-scatters gradients during the backward pass.

The memory per GPU is approximately:

Memory per GPUP+G+ON+A

where P is parameters, G is gradients, O is optimizer states, N is the number of GPUs, and A is activation memory (which benefits from gradient checkpointing).

The mixed-precision policy in FSDP2 maintains parameters in reduced precision (bf16/fp16) while accumulating gradient reductions in fp32:

MixedPrecisionPolicy(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.float32,
    cast_forward_inputs=True,
)

The device mesh abstraction organizes GPUs into multi-dimensional grids supporting both model parallelism and data parallelism:

# Model parallel mesh: (replicate, shard)
model_device_mesh = init_device_mesh(
    device_type="cuda",
    mesh_shape=(mp_replicate_size, mp_shard_size),
    mesh_dim_names=("mp_replicate", "mp_shard"),
)
# Data parallel mesh: (dp, cp)
data_device_mesh = init_device_mesh(
    device_type="cuda",
    mesh_shape=(dp_size, cp_size),
    mesh_dim_names=("dp", "cp"),
)

This enables flexible composition of parallelism dimensions (e.g., combining FSDP across 4 GPUs with 2-way context parallelism).

The launcher uses torchrun for process management and configures DDP optimizations:

env["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
env["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"

These settings improve memory allocation efficiency and reduce NCCL memory overhead in multi-GPU training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment