Environment:Hiyouga LLaMA Factory Distributed Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-06 20:00 GMT |
Overview
Distributed training environment supporting DeepSpeed ZeRO (stages 0-3), FSDP/FSDP2, and Ray for multi-GPU and multi-node LLM training.
Description
LLaMA Factory supports multiple distributed training strategies. DeepSpeed provides ZeRO optimizer stages 0-3 with optional CPU offloading. PyTorch FSDP and FSDP2 enable fully sharded data parallelism. Ray provides elastic distributed training with fault tolerance. The launcher automatically detects multi-GPU setups and invokes torchrun for distributed execution. Environment variables control node configuration, master address, and elastic launch parameters.
Usage
Use this environment when training on multiple GPUs or multiple nodes. Required for any training configuration with more than one GPU device, unless using KTransformers or Ray. DeepSpeed is activated via the deepspeed training argument, while FSDP is activated via the ACCELERATE_USE_FSDP environment variable.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | 2+ GPUs (same type recommended) | NVIDIA, Ascend NPU, or Intel XPU |
| Network | High-bandwidth interconnect | NVLink/NVSwitch recommended for multi-GPU; InfiniBand for multi-node |
| OS | Linux | Required for NCCL backend |
Dependencies
DeepSpeed
deepspeed>= 0.10.0, <= 0.18.4
FSDP
accelerate>= 1.3.0 (included in core dependencies)torch>= 2.4.0 (included in core dependencies)
FSDP2
torch>= 2.4.0accelerate>= 1.3.0
Ray
ray
Megatron-Core Adapter (MCA)
mcore-adapter
Credentials
The following environment variables configure distributed training:
FORCE_TORCHRUN: Set to1to force torchrun for single-GPU DeepSpeed or MCA.NNODES: Number of nodes (default: 1).NODE_RANK: Rank of current node (default: 0).NPROC_PER_NODE: Processes per node (default: auto-detected GPU count).MASTER_ADDR: Master node address (default: 127.0.0.1).MASTER_PORT: Master node port (default: auto-detected).ACCELERATE_USE_FSDP: Set totrueto enable FSDP.FSDP_VERSION: Set to2for FSDP2.USE_RAY: Set to1to enable Ray distributed training.USE_MCA: Set to1to enable Megatron-Core Adapter.USE_KT: Set to1to enable KTransformers (CPU/GPU hybrid).OPTIM_TORCH: Set to1(default) to enable DDP CUDA memory optimizations.MAX_RESTARTS: Maximum restarts for elastic launch (default: 0).RDZV_ID: Rendezvous ID for elastic launch.MIN_NNODES: Minimum nodes for elastic scaling.MAX_NNODES: Maximum nodes for elastic scaling.
Quick Install
# DeepSpeed
pip install deepspeed>=0.10.0
# Ray distributed training
pip install ray
# Megatron-Core Adapter
pip install mcore-adapter
Code Evidence
Automatic distributed launch from src/llamafactory/launcher.py:60-68:
if command == "train" and (
is_env_enabled("FORCE_TORCHRUN") or (get_device_count() > 1 and not use_ray() and not use_kt())
):
nnodes = os.getenv("NNODES", "1")
node_rank = os.getenv("NODE_RANK", "0")
nproc_per_node = os.getenv("NPROC_PER_NODE", str(get_device_count()))
master_addr = os.getenv("MASTER_ADDR", "127.0.0.1")
master_port = os.getenv("MASTER_PORT", str(find_available_port()))
DDP CUDA memory optimization from src/llamafactory/launcher.py:80-83:
if is_env_enabled("OPTIM_TORCH", "1"):
# optimize DDP, see https://zhuanlan.zhihu.com/p/671834539
env["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
env["TORCH_NCCL_AVOID_RECORD_STREAMS"] = "1"
DeepSpeed validation from src/llamafactory/hparams/parser.py:287-288:
if training_args.deepspeed and training_args.parallel_mode != ParallelMode.DISTRIBUTED:
raise ValueError("Please use `FORCE_TORCHRUN=1` to launch DeepSpeed training.")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
Please use FORCE_TORCHRUN=1 to launch DeepSpeed training |
DeepSpeed without distributed mode | Set FORCE_TORCHRUN=1 or use llamafactory-cli train
|
Distributed training does not support layer-wise GaLore |
Layer-wise GaLore with multi-GPU | Use non-layerwise GaLore or single GPU |
Layer-wise BAdam only supports DeepSpeed ZeRO-3 training |
BAdam layer mode without ZeRO-3 | Configure DeepSpeed ZeRO-3 |
GaLore and APOLLO are incompatible with DeepSpeed yet |
Using GaLore/APOLLO with DeepSpeed | Use FSDP or standard DDP instead |
Unsloth is incompatible with DeepSpeed ZeRO-3 |
Unsloth with ZeRO-3 | Disable Unsloth or use different parallelism |
predict_with_generate is incompatible with DeepSpeed ZeRO-3 |
Evaluation generation with ZeRO-3 | Use ZeRO-2 or separate evaluation |
Compatibility Notes
- DeepSpeed ZeRO-3: Incompatible with PTQ-quantized models (except MXFP4/FP8), Unsloth, KTransformers, PiSSA init, pure_bf16, and predict_with_generate.
- FSDP: Incompatible with PTQ-quantized models (except MXFP4/FP8). Supports QLoRA with BitsAndBytes 4-bit only.
- FSDP2: Automatically sets
use_reentrant_gc=Falsefor gradient checkpointing compatibility. - Ray: Requires explicit
USE_RAY=1environment variable. Auto-detects head node IP and available ports. - MCA: Forces
FORCE_TORCHRUN=1. Patches training args to disable predict_with_generate.