Heuristic:Alibaba ROLL Dynamic Batching Token Limits
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Distributed_Training, Performance |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
Token-based micro-batch sizing that eliminates padding waste by grouping similar-length sequences, with a recommended limit formula of `sequence_length x 2 x micro_batch_size`.
Description
Dynamic batching in ROLL replaces fixed-sequence-length batching with token-based batching that groups sequences of similar lengths to minimize padding tokens. Instead of padding all sequences to `max_seq_len`, the framework partitions rollout data across DP ranks and micro-batches based on actual token counts. This can reduce total token computation by approximately 30%. The key configuration parameter `max_tokens_per_microbatch_in_train` controls the maximum tokens per micro-batch. The sequence length rounding parameter must be divisible by `TP * CP` to maintain parallelism alignment.
Usage
Enable dynamic batching when training on data with high sequence length variance (e.g., mixed math/code/reasoning tasks) to reduce wasted computation on padding. This is especially effective when responses range from tens to thousands of tokens. Requires the `max_tokens_per_microbatch_in_train` config to be explicitly set.
The Insight (Rule of Thumb)
- Action: Set `max_tokens_per_microbatch_in_train = sequence_length * 2 * micro_batch_size`.
- Value: Typical values: 8192 for training, 16384 for inference. Rounding: 128 or 64 (must be divisible by TP * CP).
- Trade-off: Higher token limits allow more sequences per micro-batch (better GPU utilization) but increase peak memory. Lower limits reduce memory but may under-utilize compute.
- Prerequisite: Enable Flash Attention with `NVTE_FLASH_ATTN: '1'` for best performance with Context Parallel.
Reasoning
In LLM RL training, response lengths vary enormously. A batch with one 50-token response and one 4000-token response would waste 3950 tokens of padding compute per sequence under fixed batching. Dynamic batching groups similar-length sequences together, reducing padding to near-zero. The recommended formula `sequence_length * 2 * micro_batch_size` ensures the token budget is large enough to accommodate sequence length variance while preventing out-of-memory errors.
Configuration from `roll/configs/worker_config.py:142-148`:
max_tokens_per_microbatch_in_train: int = field(
default=0,
metadata={
"help": (
"This config must be set when using dynamic batching. "
"Recommended value: sequence_length * 2 * micro_batch_size."
)
},
)