Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Huggingface Open r1 Distributed Training Configuration

From Leeroopedia



Knowledge Sources
Domains Optimization, Deep_Learning, Infrastructure
Last Updated 2026-02-08 00:00 GMT

Overview

Configuration guidelines for choosing between FSDP, DDP, and DeepSpeed ZeRO strategies in Open-R1 training, with bf16 precision and gradient checkpointing best practices.

Description

Open-R1 ships four pre-configured Accelerate configs for distributed training: DDP, FSDP, DeepSpeed ZeRO-2, and DeepSpeed ZeRO-3. All are configured for 8 processes (GPUs) on a single machine with bf16 mixed precision. The choice between them depends on model size and available VRAM. Additionally, several critical configuration pitfalls exist around dataclass replacement breaking accelerator state, KV cache interaction with gradient checkpointing, and the need to align generation config EOS tokens.

Usage

Use this heuristic when setting up distributed training for SFT or GRPO workflows, when encountering OOM errors, or when troubleshooting distributed training state corruption bugs.

The Insight (Rule of Thumb)

  • Action: Choose distributed strategy based on model size and available VRAM:
    • DDP (ddp.yaml): For models that fit entirely on one GPU. No sharding overhead.
    • FSDP (fsdp.yaml): For larger models that need parameter sharding. Uses FULL_SHARD with transformer-based auto-wrap.
    • ZeRO-2 (zero2.yaml): Shards optimizer states and gradients. Good balance for 7B models.
    • ZeRO-3 (zero3.yaml): Full sharding including parameters. For models that do not fit in GPU memory otherwise. Uses zero3_save_16bit_model: true.
  • Value:
    • All configs: mixed_precision: bf16, num_processes: 8, num_machines: 1.
    • FSDP: fsdp_cpu_ram_efficient_loading: true, fsdp_forward_prefetch: true.
    • FSDP activation checkpointing: Disabled (pending Transformers PR #36610 fix).
    • Reference SFT command: per_device_train_batch_size: 2, gradient_checkpointing: true, bf16: true, use_liger_kernel: true.
  • Trade-off: More aggressive sharding (ZeRO-3 > FSDP > ZeRO-2 > DDP) reduces per-GPU memory at the cost of increased communication overhead and slower training speed.

Reasoning

bf16 mandatory: All Accelerate configs use bf16 mixed precision, which requires Ampere (A100) or newer GPUs. This is the standard for training large language models as it provides good numerical stability with half the memory footprint of fp32.

KV cache and gradient checkpointing interaction: The get_model function automatically disables KV cache (use_cache=False) when gradient checkpointing is enabled, because they are incompatible during training. After training, the code explicitly re-enables KV cache for fast inference: trainer.model.config.use_cache = True.

Accelerator state corruption: The callback system discovered that using dataclasses.replace(args, ...) or instantiating a new SFTConfig breaks the Accelerate distributed state. The workaround is to use a simple DummyConfig class that mimics only the needed attributes.

Liger kernel: The SFT recipe enables use_liger_kernel, an optimized CUDA kernel library that reduces memory usage and improves throughput for transformer operations.

EOS token alignment: After training, trainer.model.generation_config.eos_token_id must be explicitly aligned with tokenizer.eos_token_id to avoid unbounded generation when using the transformers pipeline() function.

Code Evidence

KV cache disabled during training from src/open_r1/utils/model_utils.py (via get_model):

model_kwargs["use_cache"] = False if training_args.gradient_checkpointing else True

Accelerator state corruption warning from src/open_r1/utils/callbacks.py:57-58:

# WARNING: if you use dataclasses.replace(args, ...) the accelerator dist state will be broken, so I do this workaround
# Also if you instantiate a new SFTConfig, the accelerator dist state will be broken

EOS token alignment from src/open_r1/grpo.py:145:

# Align the model's generation config with the tokenizer's eos token
# to avoid unbounded generation in the transformers `pipeline()` function
trainer.model.generation_config.eos_token_id = tokenizer.eos_token_id

KV cache re-enabled post-training from src/open_r1/grpo.py:157:

# Restore k,v cache for fast inference
trainer.model.config.use_cache = True

FSDP activation checkpointing disabled from recipes/accelerate_configs/fsdp.yaml:7:

fsdp_activation_checkpointing: false # Need fix from: https://github.com/huggingface/transformers/pull/36610

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment