Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Hiyouga LLaMA Factory Distributed Training Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Training
Last Updated 2026-02-06 20:00 GMT

Overview

Distributed training environment supporting DeepSpeed ZeRO (stages 0-3), FSDP/FSDP2, and Ray for multi-GPU and multi-node LLM training.

Description

LLaMA Factory supports multiple distributed training strategies. DeepSpeed provides ZeRO optimizer stages 0-3 with optional CPU offloading. PyTorch FSDP and FSDP2 enable fully sharded data parallelism. Ray provides elastic distributed training with fault tolerance. The launcher automatically detects multi-GPU setups and invokes torchrun for distributed execution. Environment variables control node configuration, master address, and elastic launch parameters.

Usage

Use this environment when training on multiple GPUs or multiple nodes. Required for any training configuration with more than one GPU device, unless using KTransformers or Ray. DeepSpeed is activated via the deepspeed training argument, while FSDP is activated via the ACCELERATE_USE_FSDP environment variable.

System Requirements

Category Requirement Notes
Hardware 2+ GPUs (same type recommended) NVIDIA, Ascend NPU, or Intel XPU
Network High-bandwidth interconnect NVLink/NVSwitch recommended for multi-GPU; InfiniBand for multi-node
OS Linux Required for NCCL backend

Dependencies

DeepSpeed

  • deepspeed >= 0.10.0, <= 0.18.4

FSDP

  • accelerate >= 1.3.0 (included in core dependencies)
  • torch >= 2.4.0 (included in core dependencies)

FSDP2

  • torch >= 2.4.0
  • accelerate >= 1.3.0

Ray

  • ray

Megatron-Core Adapter (MCA)

  • mcore-adapter

Credentials

The following environment variables configure distributed training:

  • FORCE_TORCHRUN: Set to 1 to force torchrun for single-GPU DeepSpeed or MCA.
  • NNODES: Number of nodes (default: 1).
  • NODE_RANK: Rank of current node (default: 0).
  • NPROC_PER_NODE: Processes per node (default: auto-detected GPU count).
  • MASTER_ADDR: Master node address (default: 127.0.0.1).
  • MASTER_PORT: Master node port (default: auto-detected).
  • ACCELERATE_USE_FSDP: Set to true to enable FSDP.
  • FSDP_VERSION: Set to 2 for FSDP2.
  • USE_RAY: Set to 1 to enable Ray distributed training.
  • USE_MCA: Set to 1 to enable Megatron-Core Adapter.
  • USE_KT: Set to 1 to enable KTransformers (CPU/GPU hybrid).
  • OPTIM_TORCH: Set to 1 (default) to enable DDP CUDA memory optimizations.
  • MAX_RESTARTS: Maximum restarts for elastic launch (default: 0).
  • RDZV_ID: Rendezvous ID for elastic launch.
  • MIN_NNODES: Minimum nodes for elastic scaling.
  • MAX_NNODES: Maximum nodes for elastic scaling.

Quick Install

# DeepSpeed
pip install deepspeed>=0.10.0

# Ray distributed training
pip install ray

# Megatron-Core Adapter
pip install mcore-adapter

Code Evidence

Automatic distributed launch from src/llamafactory/launcher.py:60-68:

if command == "train" and (
    is_env_enabled("FORCE_TORCHRUN") or (get_device_count() > 1 and not use_ray() and not use_kt())
):
    nnodes = os.getenv("NNODES", "1")
    node_rank = os.getenv("NODE_RANK", "0")
    nproc_per_node = os.getenv("NPROC_PER_NODE", str(get_device_count()))
    master_addr = os.getenv("MASTER_ADDR", "127.0.0.1")
    master_port = os.getenv("MASTER_PORT", str(find_available_port()))

DDP CUDA memory optimization from src/llamafactory/launcher.py:80-83:

if is_env_enabled("OPTIM_TORCH", "1"):
    # optimize DDP, see https://zhuanlan.zhihu.com/p/671834539
    env["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
    env["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"

DeepSpeed validation from src/llamafactory/hparams/parser.py:287-288:

if training_args.deepspeed and training_args.parallel_mode != ParallelMode.DISTRIBUTED:
    raise ValueError("Please use `FORCE_TORCHRUN=1` to launch DeepSpeed training.")

Common Errors

Error Message Cause Solution
Please use FORCE_TORCHRUN=1 to launch DeepSpeed training DeepSpeed without distributed mode Set FORCE_TORCHRUN=1 or use llamafactory-cli train
Distributed training does not support layer-wise GaLore Layer-wise GaLore with multi-GPU Use non-layerwise GaLore or single GPU
Layer-wise BAdam only supports DeepSpeed ZeRO-3 training BAdam layer mode without ZeRO-3 Configure DeepSpeed ZeRO-3
GaLore and APOLLO are incompatible with DeepSpeed yet Using GaLore/APOLLO with DeepSpeed Use FSDP or standard DDP instead
Unsloth is incompatible with DeepSpeed ZeRO-3 Unsloth with ZeRO-3 Disable Unsloth or use different parallelism
predict_with_generate is incompatible with DeepSpeed ZeRO-3 Evaluation generation with ZeRO-3 Use ZeRO-2 or separate evaluation

Compatibility Notes

  • DeepSpeed ZeRO-3: Incompatible with PTQ-quantized models (except MXFP4/FP8), Unsloth, KTransformers, PiSSA init, pure_bf16, and predict_with_generate.
  • FSDP: Incompatible with PTQ-quantized models (except MXFP4/FP8). Supports QLoRA with BitsAndBytes 4-bit only.
  • FSDP2: Automatically sets use_reentrant_gc=False for gradient checkpointing compatibility.
  • Ray: Requires explicit USE_RAY=1 environment variable. Auto-detects head node IP and available ports.
  • MCA: Forces FORCE_TORCHRUN=1. Patches training args to disable predict_with_generate.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment