Heuristic:Eric mitchell Direct preference optimization FSDP Batch Size Per GPU

Knowledge Sources	Direct Preference Optimization PyTorch FSDP Best Practices
Domains	Optimization, Distributed_Training
Last Updated	2026-02-08 02:00 GMT

Overview

Ensure batch size per GPU is at least 2 when using FSDPTrainer to achieve meaningful speedup over BasicTrainer.

Description

FSDP incurs communication overhead for gradient synchronization across GPUs. With a batch size per GPU of 1, this overhead can negate the parallelism benefit, making FSDP slower than the simpler BasicTrainer. The effective batch size per GPU is calculated as `batch_size / (gradient_accumulation_steps * num_gpus)`. The README explicitly recommends keeping this value at 2 or higher.

Usage

Use this heuristic when configuring FSDP training to ensure the batch size, gradient accumulation steps, and GPU count are set such that each GPU processes at least 2 examples per microbatch. If you cannot achieve batch size >= 2 per GPU, consider enabling mixed precision or activation checkpointing first.

The Insight (Rule of Thumb)

Action: Verify that `batch_size // (gradient_accumulation_steps * num_gpus) >= 2`.
Value: Minimum 2 examples per GPU per microbatch.
Trade-off: Higher batch size per GPU increases VRAM usage but is necessary for FSDP to outperform BasicTrainer. Use mixed precision and activation checkpointing to free memory if needed.
Compatibility: Only applies to FSDPTrainer. BasicTrainer and TensorParallelTrainer do not have this constraint.

Reasoning

FSDP shards model parameters, gradients, and optimizer states across GPUs. Each forward/backward step requires all-gather and reduce-scatter communication. With batch_size_per_gpu=1, the compute-to-communication ratio is too low, and the synchronization overhead dominates. With batch_size_per_gpu >= 2, GPUs have enough work per step to amortize the communication cost.

The README states: "In general, you should try to use a batch size of at least 2 on each GPU (i.e., `batch_size // (grad_accumulation_steps * N_GPUS)` is at least 2) to see a speedup from FSDP compared to the `BasicTrainer`."

Reference training configurations from the README:

SFT (4x A100): `batch_size=64 gradient_accumulation_steps=2` → 64/(2*4) = 8 per GPU
DPO (4x A100): `batch_size=32 gradient_accumulation_steps=2` → 32/(2*4) = 4 per GPU (note: DPO doubles effective batch through chosen+rejected concatenation)

The eval_every alignment check in `train.py:59-62` also ensures evaluation cadence is compatible with batch size:

if config.eval_every % config.batch_size != 0:
    print('WARNING: eval_every must be divisible by batch_size')
    print('Setting eval_every to', config.eval_every - config.eval_every % config.batch_size)
    config.eval_every = config.eval_every - config.eval_every % config.batch_size

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment