Heuristic:Volcengine Verl Sequence Length Balancing
Metadata:
- Sources: Repo|verl|https://github.com/volcengine/verl
- Domains: Optimization, Distributed_Training
- Last Updated: 2026-02-07 17:00 GMT
Overview
Use Karmarkar-Karp algorithm for balanced sequence partitioning across data parallel ranks to minimize idle time from variable-length sequences.
Description
In RL training with variable-length sequences, naive partitioning leads to stragglers (some GPUs processing much longer sequences). verl implements the Karmarkar-Karp Largest Differencing Method to partition sequences into balanced workload groups. The workload is estimated as 24576 * seqlen + seqlen², calibrated for 7B models (hidden_size=4096).
Usage
Enable when using seq_balance mode in training configurations. Most beneficial when batch sequences have high variance in length.
The Insight
- Action: Enable sequence balancing via configuration
- Value: Workload formula:
24576 * seqlen + seqlen²(calibrated for hidden_size=4096) - Trade-off: Adds overhead for partitioning calculation but significantly reduces GPU idle time
- Additional tip: Place smaller micro-batches at both ends of pipeline to reduce warm-up/cool-down bubbles
Reasoning
Transformer attention FLOPs scale as 12 * hidden_size² * seqlen + 2 * hidden_size * seqlen². The quadratic term means longer sequences are disproportionately expensive. The Karmarkar-Karp algorithm produces near-optimal balanced partitions. Additionally, placing smaller micro-batches at pipeline ends reduces bubble overhead.
Code Evidence
From verl/utils/seqlen_balancing.py:27-46:
def calculate_workload(seqlen_list: torch.Tensor) -> torch.Tensor:
"""workload ∝ 24576 * seqlen + seqlen²"""
return 24576 * seqlen_list + seqlen_list**2
And from verl/utils/seqlen_balancing.py:406-416 (micro-batch placement):
# Place smaller micro-batches at both ends to reduce the bubbles
# exposed during the warm-up and cool-down.
micro_bsz_idx = micro_bsz_idx[::2][::-1] + micro_bsz_idx[1::2]