Heuristic:Eric mitchell Direct preference optimization FSDP Batch Size Per GPU
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Distributed_Training |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
Ensure batch size per GPU is at least 2 when using FSDPTrainer to achieve meaningful speedup over BasicTrainer.
Description
FSDP incurs communication overhead for gradient synchronization across GPUs. With a batch size per GPU of 1, this overhead can negate the parallelism benefit, making FSDP slower than the simpler BasicTrainer. The effective batch size per GPU is calculated as `batch_size / (gradient_accumulation_steps * num_gpus)`. The README explicitly recommends keeping this value at 2 or higher.
Usage
Use this heuristic when configuring FSDP training to ensure the batch size, gradient accumulation steps, and GPU count are set such that each GPU processes at least 2 examples per microbatch. If you cannot achieve batch size >= 2 per GPU, consider enabling mixed precision or activation checkpointing first.
The Insight (Rule of Thumb)
- Action: Verify that `batch_size // (gradient_accumulation_steps * num_gpus) >= 2`.
- Value: Minimum 2 examples per GPU per microbatch.
- Trade-off: Higher batch size per GPU increases VRAM usage but is necessary for FSDP to outperform BasicTrainer. Use mixed precision and activation checkpointing to free memory if needed.
- Compatibility: Only applies to FSDPTrainer. BasicTrainer and TensorParallelTrainer do not have this constraint.
Reasoning
FSDP shards model parameters, gradients, and optimizer states across GPUs. Each forward/backward step requires all-gather and reduce-scatter communication. With batch_size_per_gpu=1, the compute-to-communication ratio is too low, and the synchronization overhead dominates. With batch_size_per_gpu >= 2, GPUs have enough work per step to amortize the communication cost.
The README states: "In general, you should try to use a batch size of at least 2 on each GPU (i.e., `batch_size // (grad_accumulation_steps * N_GPUS)` is at least 2) to see a speedup from FSDP compared to the `BasicTrainer`."
Reference training configurations from the README:
- SFT (4x A100): `batch_size=64 gradient_accumulation_steps=2` → 64/(2*4) = 8 per GPU
- DPO (4x A100): `batch_size=32 gradient_accumulation_steps=2` → 32/(2*4) = 4 per GPU (note: DPO doubles effective batch through chosen+rejected concatenation)
The eval_every alignment check in `train.py:59-62` also ensures evaluation cadence is compatible with batch size:
if config.eval_every % config.batch_size != 0:
print('WARNING: eval_every must be divisible by batch_size')
print('Setting eval_every to', config.eval_every - config.eval_every % config.batch_size)
config.eval_every = config.eval_every - config.eval_every % config.batch_size