Heuristic:Alibaba ROLL Sequence Packing Alignment

Knowledge Sources	Alibaba ROLL ROLL Sequence Packing Docs
Domains	Optimization, Performance, Distributed_Training
Last Updated	2026-02-07 19:00 GMT

Overview

Sequence packing alignment constraint requiring packed sequences to be multiples of `2 x CP_SIZE x TP_SIZE`, using the Karmarkar-Karp algorithm for load-balanced bin packing.

Description

ROLL's sequence packing feature concatenates variable-length sequences to eliminate padding tokens, significantly improving compute efficiency. However, packed sequence lengths must satisfy a strict alignment constraint: they must be multiples of `2 * CP_SIZE * TP_SIZE`. The factor of 2 is essential for Context Parallelism (CP) load balancing under causal attention, where the sequence is split into 2*CP chunks. The Karmarkar-Karp (largest differencing method) algorithm is used for load-balanced bin packing, distributing sequences across micro-batches to minimize the maximum packed length. Sequence packing is only supported with the Megatron strategy and requires Flash Attention.

Usage

Enable sequence packing when training on data with highly variable sequence lengths and using the Megatron backend. Set `algorithm: load_balance` for optimal packing. Configure `max_packed_sequence_length_train` and `max_packed_sequence_length_forward` (typically 8192). Ensure `sequence_length_round_in_train` is divisible by `TP * CP`.

The Insight (Rule of Thumb)

Action: Set alignment to `2 * CP_SIZE * TP_SIZE`. Use `algorithm: load_balance` (Karmarkar-Karp).
Value: Typical `max_packed_sequence_length`: 8192. Typical rounding: 128 or 64.
Trade-off: Sequence packing improves throughput by 20-40% but adds packing computation overhead and requires Megatron backend.
Constraint: Only works with `megatron_strategy`. Not compatible with DeepSpeed or FSDP2.

Reasoning

The alignment constraint `2 * CP * TP` ensures that when a packed sequence is distributed across Tensor Parallel and Context Parallel ranks, each rank receives an equal portion. Without this alignment, some ranks would receive more tokens than others, causing synchronization stalls and load imbalance. The factor of 2 specifically accounts for causal attention's asymmetric workload distribution in Context Parallelism, where even-odd chunk pairs must be balanced. The Karmarkar-Karp algorithm is chosen over first-fit-decreasing because it produces more balanced bins when items have diverse sizes.

Documentation note about the factor of 2:

Factor of 2 is essential for CP load balancing under causal attention
Must satisfy alignment: 2 * CP_SIZE * TP_SIZE

Ulysses attention head constraint from `roll/utils/context_parallel/ulysses_attention.py:279`:

assert (...), "Ulysses require num_key_value_heads to be dividable by ulysses_size."

Flash Attention deterministic mode from `roll/utils/context_parallel/ulysses_attention.py:114`:

deterministic = os.environ.get("FLASH_ATTENTION_DETERMINISTIC", "0") == "1"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment