Heuristic:Huggingface Alignment handbook Global Batch Size Scaling
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Distributed_Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
When scaling up or down GPUs, maintain a constant global batch size by adjusting per-device batch size or gradient accumulation steps to preserve training dynamics.
Description
The alignment-handbook documentation explicitly recommends keeping the global batch size (GBS) constant when changing the number of GPUs. GBS = num_gpus x per_device_train_batch_size x gradient_accumulation_steps. If you halve the GPUs, you should double either the per-device batch size or gradient accumulation steps. This preserves the effective learning rate scaling and ensures reproducibility.
The recipes demonstrate this pattern: QLoRA configs use smaller per_device_train_batch_size (4) with higher gradient_accumulation_steps (2-4) compared to full fine-tuning configs (batch_size=16, grad_accum=1).
Usage
Apply this whenever changing the number of GPUs from what is specified in the recipe. Also apply when switching between full and QLoRA fine-tuning, where per-device batch size must decrease due to memory constraints.
The Insight (Rule of Thumb)
- Action: Keep `global_batch_size = num_gpus x per_device_train_batch_size x gradient_accumulation_steps` constant when scaling hardware.
- Value: Example: 8 GPUs x 16 batch x 1 accum = 128 GBS. If using 4 GPUs: 4 x 16 x 2 = 128.
- Trade-off: Higher gradient accumulation steps slow down iteration speed but maintain training dynamics.
Reasoning
The learning rate, warmup schedule, and optimizer momentum are calibrated for a specific global batch size. Changing GBS without adjusting the learning rate or schedule can lead to divergence or poor convergence. The scripts README makes this recommendation explicit.
From `scripts/README.md:47`:
💡 Tip: If you scale up/down the number of GPUs, we recommend also scaling
up the per-device batch size or number of gradient accumulation steps to
keep the global batch size constant (and thus replicate our results).
Full SFT config (8 GPUs, GBS=128) from `recipes/zephyr-7b-beta/sft/config_full.yaml:31,48`:
gradient_accumulation_steps: 1
per_device_train_batch_size: 16
# GBS = 8 x 16 x 1 = 128
QLoRA SFT config (1 GPU, GBS=8) from `recipes/zephyr-7b-beta/sft/config_qlora.yaml:46-47`:
gradient_accumulation_steps: 2
per_device_train_batch_size: 4
# GBS = 1 x 4 x 2 = 8
SmolLM3 config with comment (8 nodes, GBS=128) from `recipes/smollm3/sft/sft.yaml:1,198,218`:
# Config for 8 nodes with GBS 128
gradient_accumulation_steps: 2
per_device_train_batch_size: 1
# GBS = 8 nodes x 8 GPUs x 1 x 2 = 128