Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Huggingface Alignment handbook DeepSpeed Multi Node

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Training, Deep_Learning
Last Updated 2026-02-07 00:00 GMT

Overview

Multi-node distributed training environment with DeepSpeed >= 0.17.2, Accelerate >= 1.9.0, and ZeRO Stage 3 for training large language models across multiple GPU nodes.

Description

This environment provides the distributed training infrastructure for full fine-tuning and large-scale training in the alignment-handbook. It uses HuggingFace Accelerate as the launcher with DeepSpeed ZeRO Stage 3 as the backend. ZeRO-3 shards model parameters, gradients, and optimizer states across all GPUs, enabling training of models that exceed single-GPU memory.

The alignment-handbook also supports FSDP (Fully Sharded Data Parallel) as an alternative distributed backend. Both configs use bfloat16 mixed precision and 8 GPU processes by default.

Usage

Use this environment for full fine-tuning of large models (7B+) on multi-GPU nodes, or for the SmolLM3 multi-stage post-training pipeline which uses 2-8 node configurations. Required by the DPOTrainer_APO_Zero, SFTTrainer_Mid_Training, and SFTTrainer_Multi_Task implementations.

System Requirements

Category Requirement Notes
OS Linux DeepSpeed requires Linux for distributed training
Hardware Multiple NVIDIA GPUs (8x A100 80GB typical) ZeRO-3 shards across all available GPUs
Hardware High-speed interconnect (NVLink/InfiniBand) Required for efficient multi-GPU/multi-node communication
Network Inter-node connectivity For multi-node training (rdzv_backend: static)

Dependencies

Python Packages

  • `deepspeed` >= 0.17.2
  • `accelerate` >= 1.9.0
  • `torch` >= 2.6.0 (peer dependency)
  • `ninja` >= 1.11.1 (for DeepSpeed kernel compilation)

Credentials

No additional credentials required beyond those in the PyTorch_CUDA environment.

Quick Install

# Installed as part of alignment-handbook
uv pip install .

# Or install standalone
pip install deepspeed>=0.17.2 accelerate>=1.9.0 ninja>=1.11.1

Code Evidence

DeepSpeed version requirement from `setup.py:48`:

    "deepspeed>=0.17.2",

Accelerate version requirement from `setup.py:44`:

    "accelerate>=1.9.0",

ZeRO-3 config from `recipes/accelerate_configs/zero3.yaml:3-9`:

deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3

FSDP config from `recipes/accelerate_configs/fsdp.yaml:3-15`:

distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_sharding_strategy: FULL_SHARD

NCCL timeout workaround from `recipes/smollm3/sft/sft.yaml:195`:

ddp_timeout: 18000 # avoid nccl errors when tokenizing large datasets

Common Errors

Error Message Cause Solution
`NCCL timeout` during dataset tokenization Large dataset preprocessing causes communication timeout Increase `ddp_timeout` (e.g., 18000 seconds as in SmolLM3 configs)
`RuntimeError: NCCL error` Inter-GPU communication failure Check NCCL installation and network connectivity between nodes
`deepspeed.ops.op_builder: NOT_INSTALLED` DeepSpeed ops not compiled Install ninja and rebuild: `pip install deepspeed>=0.17.2`

Compatibility Notes

  • ZeRO-3 vs FSDP: The alignment-handbook supports both. ZeRO-3 is used for the primary training pipelines; FSDP is available as an alternative with its own accelerate config.
  • Number of nodes: SmolLM3 mid-training uses 8 nodes (GBS 128); APO-Zero uses 2 nodes (GBS 32). Adjust num_machines in the accelerate config.
  • ddp_timeout: SmolLM3 configs set this to 14400-18000 seconds to avoid NCCL timeouts during large dataset tokenization.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment