Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Alibaba ROLL Dynamic Batching Token Limits

From Leeroopedia



Knowledge Sources
Domains Optimization, Distributed_Training, Performance
Last Updated 2026-02-07 19:00 GMT

Overview

Token-based micro-batch sizing that eliminates padding waste by grouping similar-length sequences, with a recommended limit formula of `sequence_length x 2 x micro_batch_size`.

Description

Dynamic batching in ROLL replaces fixed-sequence-length batching with token-based batching that groups sequences of similar lengths to minimize padding tokens. Instead of padding all sequences to `max_seq_len`, the framework partitions rollout data across DP ranks and micro-batches based on actual token counts. This can reduce total token computation by approximately 30%. The key configuration parameter `max_tokens_per_microbatch_in_train` controls the maximum tokens per micro-batch. The sequence length rounding parameter must be divisible by `TP * CP` to maintain parallelism alignment.

Usage

Enable dynamic batching when training on data with high sequence length variance (e.g., mixed math/code/reasoning tasks) to reduce wasted computation on padding. This is especially effective when responses range from tens to thousands of tokens. Requires the `max_tokens_per_microbatch_in_train` config to be explicitly set.

The Insight (Rule of Thumb)

  • Action: Set `max_tokens_per_microbatch_in_train = sequence_length * 2 * micro_batch_size`.
  • Value: Typical values: 8192 for training, 16384 for inference. Rounding: 128 or 64 (must be divisible by TP * CP).
  • Trade-off: Higher token limits allow more sequences per micro-batch (better GPU utilization) but increase peak memory. Lower limits reduce memory but may under-utilize compute.
  • Prerequisite: Enable Flash Attention with `NVTE_FLASH_ATTN: '1'` for best performance with Context Parallel.

Reasoning

In LLM RL training, response lengths vary enormously. A batch with one 50-token response and one 4000-token response would waste 3950 tokens of padding compute per sequence under fixed batching. Dynamic batching groups similar-length sequences together, reducing padding to near-zero. The recommended formula `sequence_length * 2 * micro_batch_size` ensures the token budget is large enough to accommodate sequence length variance while preventing out-of-memory errors.

Configuration from `roll/configs/worker_config.py:142-148`:

max_tokens_per_microbatch_in_train: int = field(
    default=0,
    metadata={
        "help": (
            "This config must be set when using dynamic batching. "
            "Recommended value: sequence_length * 2 * micro_batch_size."
        )
    },
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment