Implementation:Alibaba ROLL ModelUtils

Knowledge Sources	Alibaba_ROLL
Domains	Model_Architecture, Utilities
Last Updated	2026-02-07 20:00 GMT

Overview

Utility mixins and functions for parameter counting, FLOPs estimation, attention backend configuration, RMS normalization, context parallel sequence sharding, and vocabulary resizing.

Description

model_utils.py provides foundational utility classes and functions used across the MCoreAdapter model infrastructure:

ModuleUtilsMixin (lines 22-91) is a HuggingFace-inspired mixin class that adds model introspection capabilities to Megatron modules:

num_parameters() counts total or trainable-only parameters, optionally excluding embeddings
estimate_tokens() estimates token count from the main input tensor
floating_point_ops() computes approximate FLOPs using the 6 * tokens * params formula

RMSNorm (lines 94-109) is a Root Mean Square Layer Normalization implementation compatible with Megatron-Core's TransformerConfig. It supports CPU initialization, configurable epsilon, and marks its weight with the sequence_parallel attribute for proper gradient handling.

The module also provides several standalone utility functions:

exists_hf_config() and exists_mca_config() check for the presence of HuggingFace or MCA config files in a checkpoint directory
check_and_get_attention_backend_by_env() determines the attention backend from NVTE environment variables (NVTE_FLASH_ATTN, NVTE_FUSED_ATTN, NVTE_UNFUSED_ATTN)
get_thd_data_on_this_cp_rank() shards THD-format packed sequence data for context parallelism using Transformer Engine's thd_get_partitioned_indices
configure_resized_vocab_size() computes a padded vocabulary size when the tokenizer is larger than the model's original vocabulary

Additionally, internal helper _McaLoraLogitsHelper (lines 112-122) is a custom torch.autograd.Function that ensures gradient contiguity for LoRA logits in tensor-parallel settings, and mca_lora_logits_postprocess_hook (lines 137-140) is a forward hook that applies this fix.

Usage

Use ModuleUtilsMixin as a base class for any model that needs parameter counting and FLOPs estimation. Use RMSNorm as a drop-in replacement for Megatron-Core's default layer normalization when using the local transformer implementation. The utility functions are called internally by McaModelConfig and PretrainedModel during model loading and configuration.

Code Reference

Source Location

Repository: Alibaba_ROLL
File: mcore_adapter/src/mcore_adapter/models/model_utils.py
Lines: 1-213

Key Classes

ModuleUtilsMixin

class ModuleUtilsMixin:
    main_input_name: str = "input_ids"

Key methods:

num_parameters(only_trainable, exclude_embeddings) (lines 29-44): Counts parameters. When exclude_embeddings=True, excludes nn.Embedding weights. When only_trainable=True, only counts parameters with requires_grad=True.
estimate_tokens(input_dict) (lines 46-65): Returns input_dict[main_input_name].numel(). Issues a one-time warning if the main input key is not found.
floating_point_ops(input_dict, exclude_embeddings) (lines 67-91): Returns 6 * estimated_tokens * num_parameters as an approximation valid when 12 * d_model << sequence_length.

RMSNorm

class RMSNorm(nn.Module):
    def __init__(self, config: "TransformerConfig", hidden_size, eps=1e-6, **kwargs)

Root Mean Square Layer Normalization. Creates a learnable weight parameter initialized to ones. The forward pass computes weight * (x / sqrt(mean(x^2) + eps)), converting to float32 for numerical stability.

Key Functions

exists_hf_config

def exists_hf_config(model_name_or_path: str) -> bool  # line 143

Returns True if config.json exists in the given directory.

exists_mca_config

def exists_mca_config(model_name_or_path: str) -> bool  # line 147

Returns True if mca_config.json exists in the given directory.

check_and_get_attention_backend_by_env

def check_and_get_attention_backend_by_env(attention_backend: AttnBackend) -> AttnBackend  # lines 151-170

Resolves the attention backend based on NVTE environment variables when attention_backend is auto. Returns AttnBackend.flash, AttnBackend.fused, AttnBackend.unfused, AttnBackend.local, or AttnBackend.auto.

get_thd_data_on_this_cp_rank

def get_thd_data_on_this_cp_rank(
    batch: Dict[str, "torch.Tensor"],
    packed_seq_params: PackedSeqParams,
    dim3_keys: List[str] = ["attention_mask"],
) -> Dict[str, "torch.Tensor"]  # lines 173-196

Shards THD (Token-Head-Dimension) format data for context parallelism. Uses Transformer Engine's thd_get_partitioned_indices to compute sequence indices for each CP rank, then index-selects along the sequence dimension.

configure_resized_vocab_size

def configure_resized_vocab_size(
    original_vocab_size: int,
    tokenizer_len: int,
    pad_to_multiple_of: int = 64,
) -> Optional[int]  # lines 199-213

Returns a new vocabulary size padded to pad_to_multiple_of when the tokenizer is larger than the original vocab. Returns None if no resizing is needed.

Import

import torch
import torch.nn as nn
from megatron.core import mpu
from megatron.core.packed_seq_params import PackedSeqParams
from megatron.core.transformer.enums import AttnBackend

from mcore_adapter.models.model_utils import (
    ModuleUtilsMixin, RMSNorm,
    exists_hf_config, exists_mca_config,
    check_and_get_attention_backend_by_env,
    get_thd_data_on_this_cp_rank,
    configure_resized_vocab_size,
)

I/O Contract

Inputs

Name	Type	Required	Description
only_trainable	bool	No	If True, count only parameters with requires_grad (default: False)
exclude_embeddings	bool	No	If True, exclude embedding weights from count (default: False)
input_dict	Dict[str, Any]	Yes	Model input dictionary containing the main input tensor
model_name_or_path	str	Yes	Path to model checkpoint directory
attention_backend	AttnBackend	Yes	Current attention backend setting (auto triggers env var resolution)
original_vocab_size	int	Yes	Original vocabulary size from model config
tokenizer_len	int	Yes	Length of the tokenizer vocabulary

Outputs

Name	Type	Description
num_parameters	int	Total parameter count
estimated_tokens	int	Estimated number of tokens in the input batch
flops	int	Approximate floating-point operations
exists	bool	Whether a config file exists
backend	AttnBackend	Resolved attention backend
resized_vocab	Optional[int]	New vocab size or None if no resize needed

Usage Examples

from mcore_adapter.models.model_utils import (
    ModuleUtilsMixin, RMSNorm, configure_resized_vocab_size,
    check_and_get_attention_backend_by_env,
)
from megatron.core.transformer.enums import AttnBackend

# Parameter counting via mixin
class MyModel(nn.Module, ModuleUtilsMixin):
    pass
model = MyModel()
total_params = model.num_parameters()
trainable_params = model.num_parameters(only_trainable=True)

# Check attention backend
backend = check_and_get_attention_backend_by_env(AttnBackend.auto)

# Vocab resizing
new_size = configure_resized_vocab_size(
    original_vocab_size=151936,
    tokenizer_len=152000,
    pad_to_multiple_of=64,
)
# new_size = 152064 (next multiple of 64 >= 152000)

Related Pages

Environment:Alibaba_ROLL_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment