Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Huggingface Alignment handbook Global Batch Size Scaling

From Leeroopedia




Knowledge Sources
Domains Optimization, Distributed_Training
Last Updated 2026-02-07 00:00 GMT

Overview

When scaling up or down GPUs, maintain a constant global batch size by adjusting per-device batch size or gradient accumulation steps to preserve training dynamics.

Description

The alignment-handbook documentation explicitly recommends keeping the global batch size (GBS) constant when changing the number of GPUs. GBS = num_gpus x per_device_train_batch_size x gradient_accumulation_steps. If you halve the GPUs, you should double either the per-device batch size or gradient accumulation steps. This preserves the effective learning rate scaling and ensures reproducibility.

The recipes demonstrate this pattern: QLoRA configs use smaller per_device_train_batch_size (4) with higher gradient_accumulation_steps (2-4) compared to full fine-tuning configs (batch_size=16, grad_accum=1).

Usage

Apply this whenever changing the number of GPUs from what is specified in the recipe. Also apply when switching between full and QLoRA fine-tuning, where per-device batch size must decrease due to memory constraints.

The Insight (Rule of Thumb)

  • Action: Keep `global_batch_size = num_gpus x per_device_train_batch_size x gradient_accumulation_steps` constant when scaling hardware.
  • Value: Example: 8 GPUs x 16 batch x 1 accum = 128 GBS. If using 4 GPUs: 4 x 16 x 2 = 128.
  • Trade-off: Higher gradient accumulation steps slow down iteration speed but maintain training dynamics.

Reasoning

The learning rate, warmup schedule, and optimizer momentum are calibrated for a specific global batch size. Changing GBS without adjusting the learning rate or schedule can lead to divergence or poor convergence. The scripts README makes this recommendation explicit.

From `scripts/README.md:47`:

💡 Tip: If you scale up/down the number of GPUs, we recommend also scaling
up the per-device batch size or number of gradient accumulation steps to
keep the global batch size constant (and thus replicate our results).

Full SFT config (8 GPUs, GBS=128) from `recipes/zephyr-7b-beta/sft/config_full.yaml:31,48`:

gradient_accumulation_steps: 1
per_device_train_batch_size: 16
# GBS = 8 x 16 x 1 = 128

QLoRA SFT config (1 GPU, GBS=8) from `recipes/zephyr-7b-beta/sft/config_qlora.yaml:46-47`:

gradient_accumulation_steps: 2
per_device_train_batch_size: 4
# GBS = 1 x 4 x 2 = 8

SmolLM3 config with comment (8 nodes, GBS=128) from `recipes/smollm3/sft/sft.yaml:1,198,218`:

# Config for 8 nodes with GBS 128
gradient_accumulation_steps: 2
per_device_train_batch_size: 1
# GBS = 8 nodes x 8 GPUs x 1 x 2 = 128

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment