Heuristic:Alibaba ROLL Sequence Packing Alignment
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Performance, Distributed_Training |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
Sequence packing alignment constraint requiring packed sequences to be multiples of `2 x CP_SIZE x TP_SIZE`, using the Karmarkar-Karp algorithm for load-balanced bin packing.
Description
ROLL's sequence packing feature concatenates variable-length sequences to eliminate padding tokens, significantly improving compute efficiency. However, packed sequence lengths must satisfy a strict alignment constraint: they must be multiples of `2 * CP_SIZE * TP_SIZE`. The factor of 2 is essential for Context Parallelism (CP) load balancing under causal attention, where the sequence is split into 2*CP chunks. The Karmarkar-Karp (largest differencing method) algorithm is used for load-balanced bin packing, distributing sequences across micro-batches to minimize the maximum packed length. Sequence packing is only supported with the Megatron strategy and requires Flash Attention.
Usage
Enable sequence packing when training on data with highly variable sequence lengths and using the Megatron backend. Set `algorithm: load_balance` for optimal packing. Configure `max_packed_sequence_length_train` and `max_packed_sequence_length_forward` (typically 8192). Ensure `sequence_length_round_in_train` is divisible by `TP * CP`.
The Insight (Rule of Thumb)
- Action: Set alignment to `2 * CP_SIZE * TP_SIZE`. Use `algorithm: load_balance` (Karmarkar-Karp).
- Value: Typical `max_packed_sequence_length`: 8192. Typical rounding: 128 or 64.
- Trade-off: Sequence packing improves throughput by 20-40% but adds packing computation overhead and requires Megatron backend.
- Constraint: Only works with `megatron_strategy`. Not compatible with DeepSpeed or FSDP2.
Reasoning
The alignment constraint `2 * CP * TP` ensures that when a packed sequence is distributed across Tensor Parallel and Context Parallel ranks, each rank receives an equal portion. Without this alignment, some ranks would receive more tokens than others, causing synchronization stalls and load imbalance. The factor of 2 specifically accounts for causal attention's asymmetric workload distribution in Context Parallelism, where even-odd chunk pairs must be balanced. The Karmarkar-Karp algorithm is chosen over first-fit-decreasing because it produces more balanced bins when items have diverse sizes.
Documentation note about the factor of 2:
Factor of 2 is essential for CP load balancing under causal attention
Must satisfy alignment: 2 * CP_SIZE * TP_SIZE
Ulysses attention head constraint from `roll/utils/context_parallel/ulysses_attention.py:279`:
assert (...), "Ulysses require num_key_value_heads to be dividable by ulysses_size."
Flash Attention deterministic mode from `roll/utils/context_parallel/ulysses_attention.py:114`:
deterministic = os.environ.get("FLASH_ATTENTION_DETERMINISTIC", "0") == "1"