Implementation:Alibaba ROLL Trainer Utils
| Knowledge Sources | |
|---|---|
| Domains | Training, Utilities |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Training utility functions for building attention masks, validating sequence packing alignment, and creating Megatron-compatible learning rate schedulers.
Description
This module provides essential utilities for the mcore_adapter training infrastructure, covering three areas:
get_ltor_masks_and_position_ids (lines 13-35): Builds causal (left-to-right) attention masks and position IDs for autoregressive training. Creates ascending position IDs (0 to seq_length-1) expanded to batch size. When build_attention_mask is True, constructs a lower-triangular causal mask with shape [batch, 1, seq, seq]. If an optional 1D attention mask (attn_mask_1D) is provided, it zeros out attention from positions beyond each sequence's valid length, supporting variable-length sequences within a fixed-size batch. The final mask is converted to a boolean where True indicates positions to mask.
get_seqlens_in_batch (lines 38-68): Extracts cumulative sequence lengths from packed attention masks (modified from LLaMA-Factory). The attention mask uses integer labels to identify different sequences within a packed batch (e.g., 1 for first sequence, 2 for second, 0 for padding). The function counts tokens per sequence label, removes zero-count entries, computes cumulative sums, and prepends a zero for compatibility with flash attention's cu_seqlens format. Returns both the cumulative sequence lengths and the maximum sequence length.
check_pack_seq_aligned (lines 71-95): Validates that all sub-sequences within packed data are aligned to a given align_size (e.g., for context parallelism where each chunk must be evenly divisible). Iterates through sequence labels and checks that each sequence's length is divisible by align_size.
MegatronLRScheduler (lines 98-110): A thin wrapper around Megatron-Core's OptimizerParamScheduler that adds _last_lr tracking, implementing the get_last_lr() interface expected by training loops for logging the current learning rate.
get_megatron_lr_scheduler (lines 113-153): Factory function that creates a MegatronLRScheduler from HuggingFace-style training arguments. Maps HF scheduler type names to Megatron equivalents (e.g., "constant_with_warmup" to "constant", "cosine_with_min_lr" to "cosine"). Supports constant, cosine, linear, inverse-square-root, and WSD (Warmup-Stable-Decay) schedules. Extracts scheduler parameters from lr_scheduler_kwargs with sensible defaults.
Usage
Use get_ltor_masks_and_position_ids in data preparation pipelines to create causal masks for autoregressive training. Use get_seqlens_in_batch when working with sequence packing to extract flash attention-compatible sequence length information. Use get_megatron_lr_scheduler during trainer initialization to create a learning rate scheduler from HuggingFace-style training arguments.
Code Reference
Source Location
- Repository: Alibaba_ROLL
- File: mcore_adapter/src/mcore_adapter/trainer/utils.py
- Lines: 1-153
Signature
def get_ltor_masks_and_position_ids(
input_ids: torch.Tensor,
build_attention_mask: bool = True,
attn_mask_1D: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor]: ...
def get_seqlens_in_batch(
attention_mask: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor]: ...
def check_pack_seq_aligned(
attention_mask: torch.Tensor,
align_size: int,
) -> bool: ...
class MegatronLRScheduler(OptimizerParamScheduler):
_last_lr: list[float] | None = None
def get_lr(self, param_group) -> float: ...
def step(self, increment: int = 1) -> None: ...
def get_last_lr(self) -> list[float]: ...
def get_megatron_lr_scheduler(
args: TrainingArguments,
num_training_steps: int,
optimizer: MegatronOptimizer,
) -> MegatronLRScheduler: ...
Import
from mcore_adapter.trainer.utils import (
get_ltor_masks_and_position_ids,
get_seqlens_in_batch,
check_pack_seq_aligned,
MegatronLRScheduler,
get_megatron_lr_scheduler,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.Tensor | Yes | Token IDs of shape [batch_size, seq_length] |
| build_attention_mask | bool | No | Whether to build the 2D causal attention mask (default True) |
| attn_mask_1D | torch.Tensor or None | No | Optional 1D mask for variable-length sequences within a batch |
| attention_mask | torch.Tensor | Yes (for get_seqlens_in_batch, check_pack_seq_aligned) | Packed attention mask with integer labels per sequence |
| align_size | int | Yes (for check_pack_seq_aligned) | Required alignment size for sub-sequences |
| args | TrainingArguments | Yes (for get_megatron_lr_scheduler) | HuggingFace-style training arguments with LR scheduler config |
| num_training_steps | int | Yes (for get_megatron_lr_scheduler) | Total number of training steps for decay schedule |
| optimizer | MegatronOptimizer | Yes (for get_megatron_lr_scheduler) | The Megatron optimizer instance |
Outputs
| Name | Type | Description |
|---|---|---|
| (get_ltor_masks_and_position_ids) attention_mask | torch.Tensor | Boolean causal attention mask of shape [batch, 1, seq, seq] where True means masked |
| (get_ltor_masks_and_position_ids) position_ids | torch.Tensor | Position IDs of shape [batch, seq] with values 0 to seq_length-1 |
| (get_seqlens_in_batch) seqlens | torch.Tensor (int32) | Cumulative sequence lengths with leading zero, for flash attention cu_seqlens format |
| (get_seqlens_in_batch) max_seq_len | torch.Tensor (int32) | Maximum sequence length in the batch |
| (check_pack_seq_aligned) | bool | True if all sub-sequences are aligned to align_size |
| (get_megatron_lr_scheduler) | MegatronLRScheduler | Configured LR scheduler instance |
Usage Examples
from mcore_adapter.trainer.utils import (
get_ltor_masks_and_position_ids,
get_seqlens_in_batch,
check_pack_seq_aligned,
get_megatron_lr_scheduler,
)
import torch
# Build causal masks and position IDs
input_ids = torch.randint(0, 32000, (4, 2048))
attention_mask, position_ids = get_ltor_masks_and_position_ids(input_ids)
# attention_mask: [4, 1, 2048, 2048], position_ids: [4, 2048]
# Extract sequence lengths from packed attention mask
packed_mask = torch.tensor([[1, 1, 2, 2, 2, 0], [1, 2, 2, 3, 3, 3]])
cu_seqlens, max_seqlen = get_seqlens_in_batch(packed_mask)
# cu_seqlens: [0, 2, 5, 6, 7, 9, 12], max_seqlen: 3
# Check packing alignment for context parallelism
is_aligned = check_pack_seq_aligned(packed_mask, align_size=2)
# Create a Megatron LR scheduler
scheduler = get_megatron_lr_scheduler(
args=training_args,
num_training_steps=10000,
optimizer=megatron_optimizer,
)