Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL Trainer Utils

From Leeroopedia


Knowledge Sources
Domains Training, Utilities
Last Updated 2026-02-07 20:00 GMT

Overview

Training utility functions for building attention masks, validating sequence packing alignment, and creating Megatron-compatible learning rate schedulers.

Description

This module provides essential utilities for the mcore_adapter training infrastructure, covering three areas:

get_ltor_masks_and_position_ids (lines 13-35): Builds causal (left-to-right) attention masks and position IDs for autoregressive training. Creates ascending position IDs (0 to seq_length-1) expanded to batch size. When build_attention_mask is True, constructs a lower-triangular causal mask with shape [batch, 1, seq, seq]. If an optional 1D attention mask (attn_mask_1D) is provided, it zeros out attention from positions beyond each sequence's valid length, supporting variable-length sequences within a fixed-size batch. The final mask is converted to a boolean where True indicates positions to mask.

get_seqlens_in_batch (lines 38-68): Extracts cumulative sequence lengths from packed attention masks (modified from LLaMA-Factory). The attention mask uses integer labels to identify different sequences within a packed batch (e.g., 1 for first sequence, 2 for second, 0 for padding). The function counts tokens per sequence label, removes zero-count entries, computes cumulative sums, and prepends a zero for compatibility with flash attention's cu_seqlens format. Returns both the cumulative sequence lengths and the maximum sequence length.

check_pack_seq_aligned (lines 71-95): Validates that all sub-sequences within packed data are aligned to a given align_size (e.g., for context parallelism where each chunk must be evenly divisible). Iterates through sequence labels and checks that each sequence's length is divisible by align_size.

MegatronLRScheduler (lines 98-110): A thin wrapper around Megatron-Core's OptimizerParamScheduler that adds _last_lr tracking, implementing the get_last_lr() interface expected by training loops for logging the current learning rate.

get_megatron_lr_scheduler (lines 113-153): Factory function that creates a MegatronLRScheduler from HuggingFace-style training arguments. Maps HF scheduler type names to Megatron equivalents (e.g., "constant_with_warmup" to "constant", "cosine_with_min_lr" to "cosine"). Supports constant, cosine, linear, inverse-square-root, and WSD (Warmup-Stable-Decay) schedules. Extracts scheduler parameters from lr_scheduler_kwargs with sensible defaults.

Usage

Use get_ltor_masks_and_position_ids in data preparation pipelines to create causal masks for autoregressive training. Use get_seqlens_in_batch when working with sequence packing to extract flash attention-compatible sequence length information. Use get_megatron_lr_scheduler during trainer initialization to create a learning rate scheduler from HuggingFace-style training arguments.

Code Reference

Source Location

Signature

def get_ltor_masks_and_position_ids(
    input_ids: torch.Tensor,
    build_attention_mask: bool = True,
    attn_mask_1D: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor]: ...

def get_seqlens_in_batch(
    attention_mask: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor]: ...

def check_pack_seq_aligned(
    attention_mask: torch.Tensor,
    align_size: int,
) -> bool: ...

class MegatronLRScheduler(OptimizerParamScheduler):
    _last_lr: list[float] | None = None
    def get_lr(self, param_group) -> float: ...
    def step(self, increment: int = 1) -> None: ...
    def get_last_lr(self) -> list[float]: ...

def get_megatron_lr_scheduler(
    args: TrainingArguments,
    num_training_steps: int,
    optimizer: MegatronOptimizer,
) -> MegatronLRScheduler: ...

Import

from mcore_adapter.trainer.utils import (
    get_ltor_masks_and_position_ids,
    get_seqlens_in_batch,
    check_pack_seq_aligned,
    MegatronLRScheduler,
    get_megatron_lr_scheduler,
)

I/O Contract

Inputs

Name Type Required Description
input_ids torch.Tensor Yes Token IDs of shape [batch_size, seq_length]
build_attention_mask bool No Whether to build the 2D causal attention mask (default True)
attn_mask_1D torch.Tensor or None No Optional 1D mask for variable-length sequences within a batch
attention_mask torch.Tensor Yes (for get_seqlens_in_batch, check_pack_seq_aligned) Packed attention mask with integer labels per sequence
align_size int Yes (for check_pack_seq_aligned) Required alignment size for sub-sequences
args TrainingArguments Yes (for get_megatron_lr_scheduler) HuggingFace-style training arguments with LR scheduler config
num_training_steps int Yes (for get_megatron_lr_scheduler) Total number of training steps for decay schedule
optimizer MegatronOptimizer Yes (for get_megatron_lr_scheduler) The Megatron optimizer instance

Outputs

Name Type Description
(get_ltor_masks_and_position_ids) attention_mask torch.Tensor Boolean causal attention mask of shape [batch, 1, seq, seq] where True means masked
(get_ltor_masks_and_position_ids) position_ids torch.Tensor Position IDs of shape [batch, seq] with values 0 to seq_length-1
(get_seqlens_in_batch) seqlens torch.Tensor (int32) Cumulative sequence lengths with leading zero, for flash attention cu_seqlens format
(get_seqlens_in_batch) max_seq_len torch.Tensor (int32) Maximum sequence length in the batch
(check_pack_seq_aligned) bool True if all sub-sequences are aligned to align_size
(get_megatron_lr_scheduler) MegatronLRScheduler Configured LR scheduler instance

Usage Examples

from mcore_adapter.trainer.utils import (
    get_ltor_masks_and_position_ids,
    get_seqlens_in_batch,
    check_pack_seq_aligned,
    get_megatron_lr_scheduler,
)
import torch

# Build causal masks and position IDs
input_ids = torch.randint(0, 32000, (4, 2048))
attention_mask, position_ids = get_ltor_masks_and_position_ids(input_ids)
# attention_mask: [4, 1, 2048, 2048], position_ids: [4, 2048]

# Extract sequence lengths from packed attention mask
packed_mask = torch.tensor([[1, 1, 2, 2, 2, 0], [1, 2, 2, 3, 3, 3]])
cu_seqlens, max_seqlen = get_seqlens_in_batch(packed_mask)
# cu_seqlens: [0, 2, 5, 6, 7, 9, 12], max_seqlen: 3

# Check packing alignment for context parallelism
is_aligned = check_pack_seq_aligned(packed_mask, align_size=2)

# Create a Megatron LR scheduler
scheduler = get_megatron_lr_scheduler(
    args=training_args,
    num_training_steps=10000,
    optimizer=megatron_optimizer,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment