Environment:Hiyouga LLaMA Factory Distributed Training Environment

Knowledge Sources	LLaMA-Factory DeepSpeed PyTorch FSDP
Domains	Infrastructure, Distributed_Training
Last Updated	2026-02-06 20:00 GMT

Overview

Distributed training environment supporting DeepSpeed ZeRO (stages 0-3), FSDP/FSDP2, and Ray for multi-GPU and multi-node LLM training.

Description

LLaMA Factory supports multiple distributed training strategies. DeepSpeed provides ZeRO optimizer stages 0-3 with optional CPU offloading. PyTorch FSDP and FSDP2 enable fully sharded data parallelism. Ray provides elastic distributed training with fault tolerance. The launcher automatically detects multi-GPU setups and invokes torchrun for distributed execution. Environment variables control node configuration, master address, and elastic launch parameters.

Usage

Use this environment when training on multiple GPUs or multiple nodes. Required for any training configuration with more than one GPU device, unless using KTransformers or Ray. DeepSpeed is activated via the deepspeed training argument, while FSDP is activated via the ACCELERATE_USE_FSDP environment variable.

System Requirements

Category	Requirement	Notes
Hardware	2+ GPUs (same type recommended)	NVIDIA, Ascend NPU, or Intel XPU
Network	High-bandwidth interconnect	NVLink/NVSwitch recommended for multi-GPU; InfiniBand for multi-node
OS	Linux	Required for NCCL backend

Dependencies

DeepSpeed

deepspeed >= 0.10.0, <= 0.18.4

FSDP

accelerate >= 1.3.0 (included in core dependencies)
torch >= 2.4.0 (included in core dependencies)

FSDP2

torch >= 2.4.0
accelerate >= 1.3.0

Ray

ray

Megatron-Core Adapter (MCA)

mcore-adapter

Credentials

The following environment variables configure distributed training:

FORCE_TORCHRUN: Set to 1 to force torchrun for single-GPU DeepSpeed or MCA.
NNODES: Number of nodes (default: 1).
NODE_RANK: Rank of current node (default: 0).
NPROC_PER_NODE: Processes per node (default: auto-detected GPU count).
MASTER_ADDR: Master node address (default: 127.0.0.1).
MASTER_PORT: Master node port (default: auto-detected).
ACCELERATE_USE_FSDP: Set to true to enable FSDP.
FSDP_VERSION: Set to 2 for FSDP2.
USE_RAY: Set to 1 to enable Ray distributed training.
USE_MCA: Set to 1 to enable Megatron-Core Adapter.
USE_KT: Set to 1 to enable KTransformers (CPU/GPU hybrid).
OPTIM_TORCH: Set to 1 (default) to enable DDP CUDA memory optimizations.
MAX_RESTARTS: Maximum restarts for elastic launch (default: 0).
RDZV_ID: Rendezvous ID for elastic launch.
MIN_NNODES: Minimum nodes for elastic scaling.
MAX_NNODES: Maximum nodes for elastic scaling.

Quick Install

# DeepSpeed
pip install deepspeed>=0.10.0

# Ray distributed training
pip install ray

# Megatron-Core Adapter
pip install mcore-adapter

Code Evidence

Automatic distributed launch from src/llamafactory/launcher.py:60-68:

if command == "train" and (
    is_env_enabled("FORCE_TORCHRUN") or (get_device_count() > 1 and not use_ray() and not use_kt())
):
    nnodes = os.getenv("NNODES", "1")
    node_rank = os.getenv("NODE_RANK", "0")
    nproc_per_node = os.getenv("NPROC_PER_NODE", str(get_device_count()))
    master_addr = os.getenv("MASTER_ADDR", "127.0.0.1")
    master_port = os.getenv("MASTER_PORT", str(find_available_port()))

DDP CUDA memory optimization from src/llamafactory/launcher.py:80-83:

if is_env_enabled("OPTIM_TORCH", "1"):
    # optimize DDP, see https://zhuanlan.zhihu.com/p/671834539
    env["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
    env["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"

DeepSpeed validation from src/llamafactory/hparams/parser.py:287-288:

if training_args.deepspeed and training_args.parallel_mode != ParallelMode.DISTRIBUTED:
    raise ValueError("Please use `FORCE_TORCHRUN=1` to launch DeepSpeed training.")

Common Errors

Error Message	Cause	Solution
`Please use FORCE_TORCHRUN=1 to launch DeepSpeed training`	DeepSpeed without distributed mode	Set `FORCE_TORCHRUN=1` or use `llamafactory-cli train`
`Distributed training does not support layer-wise GaLore`	Layer-wise GaLore with multi-GPU	Use non-layerwise GaLore or single GPU
`Layer-wise BAdam only supports DeepSpeed ZeRO-3 training`	BAdam layer mode without ZeRO-3	Configure DeepSpeed ZeRO-3
`GaLore and APOLLO are incompatible with DeepSpeed yet`	Using GaLore/APOLLO with DeepSpeed	Use FSDP or standard DDP instead
`Unsloth is incompatible with DeepSpeed ZeRO-3`	Unsloth with ZeRO-3	Disable Unsloth or use different parallelism
`predict_with_generate is incompatible with DeepSpeed ZeRO-3`	Evaluation generation with ZeRO-3	Use ZeRO-2 or separate evaluation

Compatibility Notes

DeepSpeed ZeRO-3: Incompatible with PTQ-quantized models (except MXFP4/FP8), Unsloth, KTransformers, PiSSA init, pure_bf16, and predict_with_generate.
FSDP: Incompatible with PTQ-quantized models (except MXFP4/FP8). Supports QLoRA with BitsAndBytes 4-bit only.
FSDP2: Automatically sets use_reentrant_gc=False for gradient checkpointing compatibility.
Ray: Requires explicit USE_RAY=1 environment variable. Auto-detects head node IP and available ports.
MCA: Forces FORCE_TORCHRUN=1. Patches training args to disable predict_with_generate.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment