Environment:Huggingface Alignment handbook DeepSpeed Multi Node

Knowledge Sources	Alignment Handbook DeepSpeed HuggingFace Accelerate
Domains	Infrastructure, Distributed_Training, Deep_Learning
Last Updated	2026-02-07 00:00 GMT

Overview

Multi-node distributed training environment with DeepSpeed >= 0.17.2, Accelerate >= 1.9.0, and ZeRO Stage 3 for training large language models across multiple GPU nodes.

Description

This environment provides the distributed training infrastructure for full fine-tuning and large-scale training in the alignment-handbook. It uses HuggingFace Accelerate as the launcher with DeepSpeed ZeRO Stage 3 as the backend. ZeRO-3 shards model parameters, gradients, and optimizer states across all GPUs, enabling training of models that exceed single-GPU memory.

The alignment-handbook also supports FSDP (Fully Sharded Data Parallel) as an alternative distributed backend. Both configs use bfloat16 mixed precision and 8 GPU processes by default.

Usage

Use this environment for full fine-tuning of large models (7B+) on multi-GPU nodes, or for the SmolLM3 multi-stage post-training pipeline which uses 2-8 node configurations. Required by the DPOTrainer_APO_Zero, SFTTrainer_Mid_Training, and SFTTrainer_Multi_Task implementations.

System Requirements

Category	Requirement	Notes
OS	Linux	DeepSpeed requires Linux for distributed training
Hardware	Multiple NVIDIA GPUs (8x A100 80GB typical)	ZeRO-3 shards across all available GPUs
Hardware	High-speed interconnect (NVLink/InfiniBand)	Required for efficient multi-GPU/multi-node communication
Network	Inter-node connectivity	For multi-node training (rdzv_backend: static)

Dependencies

Python Packages

`deepspeed` >= 0.17.2
`accelerate` >= 1.9.0
`torch` >= 2.6.0 (peer dependency)
`ninja` >= 1.11.1 (for DeepSpeed kernel compilation)

Credentials

No additional credentials required beyond those in the PyTorch_CUDA environment.

Quick Install

# Installed as part of alignment-handbook
uv pip install .

# Or install standalone
pip install deepspeed>=0.17.2 accelerate>=1.9.0 ninja>=1.11.1

Code Evidence

DeepSpeed version requirement from `setup.py:48`:

    "deepspeed>=0.17.2",

Accelerate version requirement from `setup.py:44`:

    "accelerate>=1.9.0",

ZeRO-3 config from `recipes/accelerate_configs/zero3.yaml:3-9`:

deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3

FSDP config from `recipes/accelerate_configs/fsdp.yaml:3-15`:

distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_sharding_strategy: FULL_SHARD

NCCL timeout workaround from `recipes/smollm3/sft/sft.yaml:195`:

ddp_timeout: 18000 # avoid nccl errors when tokenizing large datasets

Common Errors

Error Message	Cause	Solution
`NCCL timeout` during dataset tokenization	Large dataset preprocessing causes communication timeout	Increase `ddp_timeout` (e.g., 18000 seconds as in SmolLM3 configs)
`RuntimeError: NCCL error`	Inter-GPU communication failure	Check NCCL installation and network connectivity between nodes
`deepspeed.ops.op_builder: NOT_INSTALLED`	DeepSpeed ops not compiled	Install ninja and rebuild: `pip install deepspeed>=0.17.2`

Compatibility Notes

ZeRO-3 vs FSDP: The alignment-handbook supports both. ZeRO-3 is used for the primary training pipelines; FSDP is available as an alternative with its own accelerate config.
Number of nodes: SmolLM3 mid-training uses 8 nodes (GBS 128); APO-Zero uses 2 nodes (GBS 32). Adjust num_machines in the accelerate config.
ddp_timeout: SmolLM3 configs set this to 14400-18000 seconds to avoid NCCL timeouts during large dataset tokenization.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment