Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Alibaba ROLL Sequence Packing Alignment

From Leeroopedia




Knowledge Sources
Domains Optimization, Performance, Distributed_Training
Last Updated 2026-02-07 19:00 GMT

Overview

Sequence packing alignment constraint requiring packed sequences to be multiples of `2 x CP_SIZE x TP_SIZE`, using the Karmarkar-Karp algorithm for load-balanced bin packing.

Description

ROLL's sequence packing feature concatenates variable-length sequences to eliminate padding tokens, significantly improving compute efficiency. However, packed sequence lengths must satisfy a strict alignment constraint: they must be multiples of `2 * CP_SIZE * TP_SIZE`. The factor of 2 is essential for Context Parallelism (CP) load balancing under causal attention, where the sequence is split into 2*CP chunks. The Karmarkar-Karp (largest differencing method) algorithm is used for load-balanced bin packing, distributing sequences across micro-batches to minimize the maximum packed length. Sequence packing is only supported with the Megatron strategy and requires Flash Attention.

Usage

Enable sequence packing when training on data with highly variable sequence lengths and using the Megatron backend. Set `algorithm: load_balance` for optimal packing. Configure `max_packed_sequence_length_train` and `max_packed_sequence_length_forward` (typically 8192). Ensure `sequence_length_round_in_train` is divisible by `TP * CP`.

The Insight (Rule of Thumb)

  • Action: Set alignment to `2 * CP_SIZE * TP_SIZE`. Use `algorithm: load_balance` (Karmarkar-Karp).
  • Value: Typical `max_packed_sequence_length`: 8192. Typical rounding: 128 or 64.
  • Trade-off: Sequence packing improves throughput by 20-40% but adds packing computation overhead and requires Megatron backend.
  • Constraint: Only works with `megatron_strategy`. Not compatible with DeepSpeed or FSDP2.

Reasoning

The alignment constraint `2 * CP * TP` ensures that when a packed sequence is distributed across Tensor Parallel and Context Parallel ranks, each rank receives an equal portion. Without this alignment, some ranks would receive more tokens than others, causing synchronization stalls and load imbalance. The factor of 2 specifically accounts for causal attention's asymmetric workload distribution in Context Parallelism, where even-odd chunk pairs must be balanced. The Karmarkar-Karp algorithm is chosen over first-fit-decreasing because it produces more balanced bins when items have diverse sizes.

Documentation note about the factor of 2:

Factor of 2 is essential for CP load balancing under causal attention
Must satisfy alignment: 2 * CP_SIZE * TP_SIZE

Ulysses attention head constraint from `roll/utils/context_parallel/ulysses_attention.py:279`:

assert (...), "Ulysses require num_key_value_heads to be dividable by ulysses_size."

Flash Attention deterministic mode from `roll/utils/context_parallel/ulysses_attention.py:114`:

deterministic = os.environ.get("FLASH_ATTENTION_DETERMINISTIC", "0") == "1"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment