Environment:Axolotl ai cloud Axolotl Multi GPU

Knowledge Sources	Axolotl PyTorch FSDP DeepSpeed
Domains	Infrastructure, Distributed_Training
Last Updated	2026-02-06 22:33 GMT

Overview

Multi-GPU distributed training environment with FSDP2 or DeepSpeed ZeRO, NCCL communication backend, and tensor/context parallelism support.

Description

This environment extends the CUDA_GPU environment with distributed training capabilities across multiple GPUs. It supports FSDP (version 1 and 2), DeepSpeed ZeRO (stages 0-3), tensor parallelism, context parallelism, and hybrid sharding (HSDP). The runtime programmatically sets dozens of environment variables for Accelerate, FSDP, and DeepSpeed based on the user's YAML configuration. NCCL is used as the communication backend, with automatic P2P support detection and fallback.

Usage

Use this environment when training requires more than one GPU, either for data parallelism, model parallelism, or sharding strategies. Required for FSDP weight consolidation, multi-node training, and large models that do not fit on a single GPU (e.g., 70B+ parameter models).

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+)	Multi-GPU not supported on macOS or Windows
Hardware	2+ NVIDIA GPUs	Connected via NVLink or PCIe; NVLink recommended for large models
Network	InfiniBand or high-speed Ethernet	Required for multi-node; single-node uses NVLink/PCIe
NCCL	libnccl2 (bundled with PyTorch)	NCCL P2P auto-detected; disabled if unsupported
Disk	100GB+ SSD per node	For model shards, checkpoints, and DeepSpeed offload

Dependencies

System Packages

`nvidia-driver` >= 525
`libnccl2` (typically bundled with PyTorch CUDA)

Python Packages

All packages from Environment:Axolotl_ai_cloud_Axolotl_CUDA_GPU
`deepspeed` >= 0.18.3 (optional, for DeepSpeed ZeRO strategies)
`deepspeed-kernels` (optional, with DeepSpeed)
`ray[train]` >= 2.52.1 (optional, for Ray-based distributed training)

Credentials

All credentials from Environment:Axolotl_ai_cloud_Axolotl_Python_Runtime apply, plus:

Distributed Training Variables (auto-set by Axolotl):

`WORLD_SIZE`: Total number of processes (default: "1").
`LOCAL_RANK`: Local rank of the process (default: "0").
`RANK`: Global rank of the process.
`MASTER_ADDR`: Address of the master node for distributed init.
`MASTER_PORT`: Port for distributed communication.
`NODE_WORLD_SIZE`: Processes per node (default: "8").
`AXOLOTL_NCCL_TIMEOUT`: NCCL timeout in seconds (default: 1800 = 30 minutes).

DeepSpeed Variables (auto-set when deepspeed config used):

`ACCELERATE_USE_DEEPSPEED`: Set to "true" when DeepSpeed is active.
`ACCELERATE_DEEPSPEED_CONFIG_FILE`: Path to DeepSpeed JSON config.
`ACCELERATE_DEEPSPEED_ZERO_STAGE`: ZeRO stage (0, 1, 2, or 3).
`ACCELERATE_DEEPSPEED_ZERO3_INIT`: Set to "true" for ZeRO stage 3.
`ACCELERATE_GRADIENT_ACCUMULATION_STEPS`: Gradient accumulation steps.

FSDP Variables (auto-set when fsdp_config used):

`ACCELERATE_USE_FSDP`: Set to "true" when FSDP is active.
`FSDP_VERSION`: "1" or "2" (FSDP2 recommended).
`FSDP_ACTIVATION_CHECKPOINTING`: Enable activation checkpointing.
`FSDP_OFFLOAD_PARAMS`: Enable CPU parameter offloading.
`FSDP_SYNC_MODULE_STATES`: Sync module states across ranks.
`FSDP_CPU_RAM_EFFICIENT_LOADING`: Enable RAM-efficient loading.
`FSDP_USE_ORIG_PARAMS`: Use original parameters (required for torch.compile).
`FSDP_STATE_DICT_TYPE`: Checkpoint format (FULL_STATE_DICT, SHARDED_STATE_DICT).
`FSDP_AUTO_WRAP_POLICY`: Wrapping policy (TRANSFORMER_BASED_WRAP).
`FSDP_TRANSFORMER_CLS_TO_WRAP`: Transformer class name to wrap.
`FSDP_RESHARD_AFTER_FORWARD`: Resharding strategy (default "2" = Reshard).

Tensor/Context Parallelism Variables:

`PARALLELISM_CONFIG_TP_SIZE`: Tensor parallel size.
`PARALLELISM_CONFIG_DP_SHARD_SIZE`: Data parallel shard size.
`PARALLELISM_CONFIG_DP_REPLICATE_SIZE`: Data parallel replicate size.
`PARALLELISM_CONFIG_CP_SIZE`: Context parallel size.
`ACCELERATE_USE_PARALLELISM_CONFIG`: Enable N-D parallelism config.

NCCL Variables:

`NCCL_P2P_DISABLE`: Set to "1" when P2P not supported between GPUs.

Quick Install

# Install Axolotl with DeepSpeed support
pip install axolotl[deepspeed]

# Launch multi-GPU training with Accelerate
accelerate launch -m axolotl.cli.train config.yaml

# Or use torchrun directly
torchrun --nproc_per_node=4 -m axolotl.cli.train config.yaml

Code Evidence

Distributed environment variable setup from `src/axolotl/utils/trainer.py:545-599`:

def prepare_optim_env(cfg, kwargs_handlers=None):
    # DeepSpeed setup
    if cfg.deepspeed:
        os.environ["ACCELERATE_USE_DEEPSPEED"] = "true"
        os.environ["ACCELERATE_DEEPSPEED_CONFIG_FILE"] = cfg.deepspeed
        if "zero3" in cfg.deepspeed:
            os.environ["ACCELERATE_DEEPSPEED_ZERO_STAGE"] = "3"
            os.environ["ACCELERATE_DEEPSPEED_ZERO3_INIT"] = "true"

FSDP configuration from `src/axolotl/utils/trainer.py:601-630`:

if cfg.fsdp_config:
    os.environ["ACCELERATE_USE_FSDP"] = "true"
    if cfg.fsdp_config.fsdp_version == 2:
        os.environ["FSDP_VERSION"] = "2"
    if cfg.fsdp_config.fsdp_activation_checkpointing:
        os.environ["FSDP_ACTIVATION_CHECKPOINTING"] = "true"

NCCL P2P detection from `src/axolotl/utils/trainer.py:657-659`:

if not torch.cuda.is_available() or not torch.cuda.get_device_properties(0).major:
    os.environ["NCCL_P2P_DISABLE"] = "1"

NCCL timeout configuration from `src/axolotl/utils/distributed.py:53`:

timeout = int(os.environ.get("AXOLOTL_NCCL_TIMEOUT", 1800))

Common Errors

Error Message	Cause	Solution
`NCCL error: unhandled cuda error`	NCCL communication failure	Check GPU interconnect; set `NCCL_P2P_DISABLE=1` if P2P not supported
`torch.utils.checkpoint.CheckpointError: Recomputed values have different metadata`	QLoRA + ZeRO3 + `use_reentrant: False`	Known incompatibility; try `use_reentrant: True` or switch to FSDP
`FSDP2 not compatible with adamw_8bit`	8-bit optimizer incompatible with FSDP2	Use `adamw_torch_8bit` optimizer instead
`RuntimeError: Timeout in distributed init`	NCCL timeout too short for large model loading	Increase `AXOLOTL_NCCL_TIMEOUT` (default 1800 seconds)
Checkpoint consolidation fails	Sharded FSDP checkpoints need merging	Use `axolotl merge-sharded-fsdp-weights` CLI command

Compatibility Notes

FSDP1 vs FSDP2: FSDP1 is deprecated in Axolotl. FSDP2 is recommended for better performance and compatibility.
DeepSpeed ZeRO3: Requires `ACCELERATE_DEEPSPEED_ZERO3_INIT=true` for proper model initialization. QLoRA + ZeRO3 may cause CheckpointError with `use_reentrant: False`.
Tensor Parallelism: Supported via `PARALLELISM_CONFIG_TP_SIZE`. Requires all GPUs within a TP group to be on the same node with NVLink.
Context Parallelism: Supported via `PARALLELISM_CONFIG_CP_SIZE` for long sequence training.
Multi-node: Requires SLURM or manual torchrun setup. See `examples/slurm/README.md` for SLURM guide.
NCCL P2P: Auto-detected. Disabled automatically when not supported (e.g., mixed GPU generations).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment