Environment:Huggingface Alignment handbook DeepSpeed Multi Node
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training, Deep_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Multi-node distributed training environment with DeepSpeed >= 0.17.2, Accelerate >= 1.9.0, and ZeRO Stage 3 for training large language models across multiple GPU nodes.
Description
This environment provides the distributed training infrastructure for full fine-tuning and large-scale training in the alignment-handbook. It uses HuggingFace Accelerate as the launcher with DeepSpeed ZeRO Stage 3 as the backend. ZeRO-3 shards model parameters, gradients, and optimizer states across all GPUs, enabling training of models that exceed single-GPU memory.
The alignment-handbook also supports FSDP (Fully Sharded Data Parallel) as an alternative distributed backend. Both configs use bfloat16 mixed precision and 8 GPU processes by default.
Usage
Use this environment for full fine-tuning of large models (7B+) on multi-GPU nodes, or for the SmolLM3 multi-stage post-training pipeline which uses 2-8 node configurations. Required by the DPOTrainer_APO_Zero, SFTTrainer_Mid_Training, and SFTTrainer_Multi_Task implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | DeepSpeed requires Linux for distributed training |
| Hardware | Multiple NVIDIA GPUs (8x A100 80GB typical) | ZeRO-3 shards across all available GPUs |
| Hardware | High-speed interconnect (NVLink/InfiniBand) | Required for efficient multi-GPU/multi-node communication |
| Network | Inter-node connectivity | For multi-node training (rdzv_backend: static) |
Dependencies
Python Packages
- `deepspeed` >= 0.17.2
- `accelerate` >= 1.9.0
- `torch` >= 2.6.0 (peer dependency)
- `ninja` >= 1.11.1 (for DeepSpeed kernel compilation)
Credentials
No additional credentials required beyond those in the PyTorch_CUDA environment.
Quick Install
# Installed as part of alignment-handbook
uv pip install .
# Or install standalone
pip install deepspeed>=0.17.2 accelerate>=1.9.0 ninja>=1.11.1
Code Evidence
DeepSpeed version requirement from `setup.py:48`:
"deepspeed>=0.17.2",
Accelerate version requirement from `setup.py:44`:
"accelerate>=1.9.0",
ZeRO-3 config from `recipes/accelerate_configs/zero3.yaml:3-9`:
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
FSDP config from `recipes/accelerate_configs/fsdp.yaml:3-15`:
distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_sharding_strategy: FULL_SHARD
NCCL timeout workaround from `recipes/smollm3/sft/sft.yaml:195`:
ddp_timeout: 18000 # avoid nccl errors when tokenizing large datasets
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `NCCL timeout` during dataset tokenization | Large dataset preprocessing causes communication timeout | Increase `ddp_timeout` (e.g., 18000 seconds as in SmolLM3 configs) |
| `RuntimeError: NCCL error` | Inter-GPU communication failure | Check NCCL installation and network connectivity between nodes |
| `deepspeed.ops.op_builder: NOT_INSTALLED` | DeepSpeed ops not compiled | Install ninja and rebuild: `pip install deepspeed>=0.17.2` |
Compatibility Notes
- ZeRO-3 vs FSDP: The alignment-handbook supports both. ZeRO-3 is used for the primary training pipelines; FSDP is available as an alternative with its own accelerate config.
- Number of nodes: SmolLM3 mid-training uses 8 nodes (GBS 128); APO-Zero uses 2 nodes (GBS 32). Adjust num_machines in the accelerate config.
- ddp_timeout: SmolLM3 configs set this to 14400-18000 seconds to avoid NCCL timeouts during large dataset tokenization.