Heuristic:ContextualAI HALOs Batch Size Divisibility

Knowledge Sources	ContextualAI HALOs
Domains	Distributed_Training, Configuration, LLM_Alignment
Last Updated	2026-02-08 03:00 GMT

Overview

Batch size must be divisible by `num_processes * gradient_accumulation_steps` to prevent training failures and ensure even data distribution across GPUs.

Description

The HALOs framework enforces strict divisibility constraints on batch sizes for distributed training with FSDP. The global `batch_size` and `eval_batch_size` must each be evenly divisible by the product of the number of GPU processes and the gradient accumulation steps. This ensures each GPU receives exactly the same number of examples per step, preventing FSDP synchronization hangs from uneven batch sizes. Additionally, `eval_every` must be divisible by `batch_size` (auto-corrected if not). For KTO/GRPO with `UnpairedPreferenceDataLoader`, `microbatch_size * num_processes` must be greater than 1 to ensure a mix of chosen and rejected examples.

Usage

Apply this heuristic before launching any training job. Verify that your batch size configuration satisfies the divisibility constraints given your GPU count and gradient accumulation settings. The training script will raise a `ValueError` immediately if these constraints are violated, so catching these errors early saves wasted time.

The Insight (Rule of Thumb)

Action: Ensure `batch_size % (num_processes * gradient_accumulation_steps) == 0`.
Value: Default `batch_size=32`, `gradient_accumulation_steps=1`. With 4 GPUs: `32 / (4 * 1) = 8` examples per GPU per step.
Trade-off: Smaller microbatch sizes (per GPU) reduce memory usage but may hurt training stability. Larger accumulation steps simulate larger batches without extra memory.
Common configurations:
- 4 GPUs, batch_size=32, grad_accum=1 -> microbatch=8
- 4 GPUs, batch_size=16, grad_accum=1 -> microbatch=4
- 8 GPUs, batch_size=32, grad_accum=1 -> microbatch=4

Reasoning

FSDP requires all processes to participate in each gradient synchronization step. If batch sizes are uneven, some processes will finish their microbatch before others, causing the NCCL collective operations to hang. The framework also discards excess data that cannot fill a complete global batch (`usable_size = len(flat_data) // global_batch_size * global_batch_size`), ensuring clean epoch boundaries. The `eval_every` alignment with `batch_size` prevents attempting to evaluate mid-accumulation.

Code Evidence

Strict divisibility check in `launch.py:65-68`:

if config.model.batch_size % (accelerator.num_processes * config.model.gradient_accumulation_steps) == 0:
    config.model.microbatch_size = config.model.batch_size / (accelerator.num_processes * config.model.gradient_accumulation_steps)
else:
    raise ValueError(f"{config.model.batch_size} needs to be divisible by the number of processes * gradient_accumulation_steps")

Auto-correction of eval_every in `launch.py:75-78`:

if config.eval_every % config.model.batch_size != 0:
    accelerator.print('WARNING: eval_every must be divisible by batch_size')
    accelerator.print('Setting eval_every to', config.eval_every - config.eval_every % config.model.batch_size)
    config.eval_every = config.eval_every - config.eval_every % config.model.batch_size

UnpairedPreferenceDataLoader minimum batch constraint in `train/dataloader.py:465-466`:

if self.microbatch_size * self.num_processes <= 1:
    raise ValueError("can't use batch size of 1 with UnpairedPreferenceDataLoader")

Data truncation for even distribution in `train/dataloader.py:266-269`:

if self.num_processes == 1:
    usable_size = len(flat_data)
else:
    usable_size = len(flat_data) // self.global_batch_size * self.global_batch_size

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment