Heuristic:ContextualAI HALOs Batch Size Divisibility
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Configuration, LLM_Alignment |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
Batch size must be divisible by `num_processes * gradient_accumulation_steps` to prevent training failures and ensure even data distribution across GPUs.
Description
The HALOs framework enforces strict divisibility constraints on batch sizes for distributed training with FSDP. The global `batch_size` and `eval_batch_size` must each be evenly divisible by the product of the number of GPU processes and the gradient accumulation steps. This ensures each GPU receives exactly the same number of examples per step, preventing FSDP synchronization hangs from uneven batch sizes. Additionally, `eval_every` must be divisible by `batch_size` (auto-corrected if not). For KTO/GRPO with `UnpairedPreferenceDataLoader`, `microbatch_size * num_processes` must be greater than 1 to ensure a mix of chosen and rejected examples.
Usage
Apply this heuristic before launching any training job. Verify that your batch size configuration satisfies the divisibility constraints given your GPU count and gradient accumulation settings. The training script will raise a `ValueError` immediately if these constraints are violated, so catching these errors early saves wasted time.
The Insight (Rule of Thumb)
- Action: Ensure `batch_size % (num_processes * gradient_accumulation_steps) == 0`.
- Value: Default `batch_size=32`, `gradient_accumulation_steps=1`. With 4 GPUs: `32 / (4 * 1) = 8` examples per GPU per step.
- Trade-off: Smaller microbatch sizes (per GPU) reduce memory usage but may hurt training stability. Larger accumulation steps simulate larger batches without extra memory.
- Common configurations:
- 4 GPUs, batch_size=32, grad_accum=1 -> microbatch=8
- 4 GPUs, batch_size=16, grad_accum=1 -> microbatch=4
- 8 GPUs, batch_size=32, grad_accum=1 -> microbatch=4
Reasoning
FSDP requires all processes to participate in each gradient synchronization step. If batch sizes are uneven, some processes will finish their microbatch before others, causing the NCCL collective operations to hang. The framework also discards excess data that cannot fill a complete global batch (`usable_size = len(flat_data) // global_batch_size * global_batch_size`), ensuring clean epoch boundaries. The `eval_every` alignment with `batch_size` prevents attempting to evaluate mid-accumulation.
Code Evidence
Strict divisibility check in `launch.py:65-68`:
if config.model.batch_size % (accelerator.num_processes * config.model.gradient_accumulation_steps) == 0:
config.model.microbatch_size = config.model.batch_size / (accelerator.num_processes * config.model.gradient_accumulation_steps)
else:
raise ValueError(f"{config.model.batch_size} needs to be divisible by the number of processes * gradient_accumulation_steps")
Auto-correction of eval_every in `launch.py:75-78`:
if config.eval_every % config.model.batch_size != 0:
accelerator.print('WARNING: eval_every must be divisible by batch_size')
accelerator.print('Setting eval_every to', config.eval_every - config.eval_every % config.model.batch_size)
config.eval_every = config.eval_every - config.eval_every % config.model.batch_size
UnpairedPreferenceDataLoader minimum batch constraint in `train/dataloader.py:465-466`:
if self.microbatch_size * self.num_processes <= 1:
raise ValueError("can't use batch size of 1 with UnpairedPreferenceDataLoader")
Data truncation for even distribution in `train/dataloader.py:266-269`:
if self.num_processes == 1:
usable_size = len(flat_data)
else:
usable_size = len(flat_data) // self.global_batch_size * self.global_batch_size