Implementation:Alibaba ROLL ModelUtils
| Knowledge Sources | |
|---|---|
| Domains | Model_Architecture, Utilities |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Utility mixins and functions for parameter counting, FLOPs estimation, attention backend configuration, RMS normalization, context parallel sequence sharding, and vocabulary resizing.
Description
model_utils.py provides foundational utility classes and functions used across the MCoreAdapter model infrastructure:
ModuleUtilsMixin (lines 22-91) is a HuggingFace-inspired mixin class that adds model introspection capabilities to Megatron modules:
- num_parameters() counts total or trainable-only parameters, optionally excluding embeddings
- estimate_tokens() estimates token count from the main input tensor
- floating_point_ops() computes approximate FLOPs using the 6 * tokens * params formula
RMSNorm (lines 94-109) is a Root Mean Square Layer Normalization implementation compatible with Megatron-Core's TransformerConfig. It supports CPU initialization, configurable epsilon, and marks its weight with the sequence_parallel attribute for proper gradient handling.
The module also provides several standalone utility functions:
- exists_hf_config() and exists_mca_config() check for the presence of HuggingFace or MCA config files in a checkpoint directory
- check_and_get_attention_backend_by_env() determines the attention backend from NVTE environment variables (NVTE_FLASH_ATTN, NVTE_FUSED_ATTN, NVTE_UNFUSED_ATTN)
- get_thd_data_on_this_cp_rank() shards THD-format packed sequence data for context parallelism using Transformer Engine's thd_get_partitioned_indices
- configure_resized_vocab_size() computes a padded vocabulary size when the tokenizer is larger than the model's original vocabulary
Additionally, internal helper _McaLoraLogitsHelper (lines 112-122) is a custom torch.autograd.Function that ensures gradient contiguity for LoRA logits in tensor-parallel settings, and mca_lora_logits_postprocess_hook (lines 137-140) is a forward hook that applies this fix.
Usage
Use ModuleUtilsMixin as a base class for any model that needs parameter counting and FLOPs estimation. Use RMSNorm as a drop-in replacement for Megatron-Core's default layer normalization when using the local transformer implementation. The utility functions are called internally by McaModelConfig and PretrainedModel during model loading and configuration.
Code Reference
Source Location
- Repository: Alibaba_ROLL
- File: mcore_adapter/src/mcore_adapter/models/model_utils.py
- Lines: 1-213
Key Classes
ModuleUtilsMixin
class ModuleUtilsMixin:
main_input_name: str = "input_ids"
Key methods:
- num_parameters(only_trainable, exclude_embeddings) (lines 29-44): Counts parameters. When exclude_embeddings=True, excludes nn.Embedding weights. When only_trainable=True, only counts parameters with requires_grad=True.
- estimate_tokens(input_dict) (lines 46-65): Returns input_dict[main_input_name].numel(). Issues a one-time warning if the main input key is not found.
- floating_point_ops(input_dict, exclude_embeddings) (lines 67-91): Returns 6 * estimated_tokens * num_parameters as an approximation valid when 12 * d_model << sequence_length.
RMSNorm
class RMSNorm(nn.Module):
def __init__(self, config: "TransformerConfig", hidden_size, eps=1e-6, **kwargs)
Root Mean Square Layer Normalization. Creates a learnable weight parameter initialized to ones. The forward pass computes weight * (x / sqrt(mean(x^2) + eps)), converting to float32 for numerical stability.
Key Functions
exists_hf_config
def exists_hf_config(model_name_or_path: str) -> bool # line 143
Returns True if config.json exists in the given directory.
exists_mca_config
def exists_mca_config(model_name_or_path: str) -> bool # line 147
Returns True if mca_config.json exists in the given directory.
check_and_get_attention_backend_by_env
def check_and_get_attention_backend_by_env(attention_backend: AttnBackend) -> AttnBackend # lines 151-170
Resolves the attention backend based on NVTE environment variables when attention_backend is auto. Returns AttnBackend.flash, AttnBackend.fused, AttnBackend.unfused, AttnBackend.local, or AttnBackend.auto.
get_thd_data_on_this_cp_rank
def get_thd_data_on_this_cp_rank(
batch: Dict[str, "torch.Tensor"],
packed_seq_params: PackedSeqParams,
dim3_keys: List[str] = ["attention_mask"],
) -> Dict[str, "torch.Tensor"] # lines 173-196
Shards THD (Token-Head-Dimension) format data for context parallelism. Uses Transformer Engine's thd_get_partitioned_indices to compute sequence indices for each CP rank, then index-selects along the sequence dimension.
configure_resized_vocab_size
def configure_resized_vocab_size(
original_vocab_size: int,
tokenizer_len: int,
pad_to_multiple_of: int = 64,
) -> Optional[int] # lines 199-213
Returns a new vocabulary size padded to pad_to_multiple_of when the tokenizer is larger than the original vocab. Returns None if no resizing is needed.
Import
import torch
import torch.nn as nn
from megatron.core import mpu
from megatron.core.packed_seq_params import PackedSeqParams
from megatron.core.transformer.enums import AttnBackend
from mcore_adapter.models.model_utils import (
ModuleUtilsMixin, RMSNorm,
exists_hf_config, exists_mca_config,
check_and_get_attention_backend_by_env,
get_thd_data_on_this_cp_rank,
configure_resized_vocab_size,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| only_trainable | bool | No | If True, count only parameters with requires_grad (default: False) |
| exclude_embeddings | bool | No | If True, exclude embedding weights from count (default: False) |
| input_dict | Dict[str, Any] | Yes | Model input dictionary containing the main input tensor |
| model_name_or_path | str | Yes | Path to model checkpoint directory |
| attention_backend | AttnBackend | Yes | Current attention backend setting (auto triggers env var resolution) |
| original_vocab_size | int | Yes | Original vocabulary size from model config |
| tokenizer_len | int | Yes | Length of the tokenizer vocabulary |
Outputs
| Name | Type | Description |
|---|---|---|
| num_parameters | int | Total parameter count |
| estimated_tokens | int | Estimated number of tokens in the input batch |
| flops | int | Approximate floating-point operations |
| exists | bool | Whether a config file exists |
| backend | AttnBackend | Resolved attention backend |
| resized_vocab | Optional[int] | New vocab size or None if no resize needed |
Usage Examples
from mcore_adapter.models.model_utils import (
ModuleUtilsMixin, RMSNorm, configure_resized_vocab_size,
check_and_get_attention_backend_by_env,
)
from megatron.core.transformer.enums import AttnBackend
# Parameter counting via mixin
class MyModel(nn.Module, ModuleUtilsMixin):
pass
model = MyModel()
total_params = model.num_parameters()
trainable_params = model.num_parameters(only_trainable=True)
# Check attention backend
backend = check_and_get_attention_backend_by_env(AttnBackend.auto)
# Vocab resizing
new_size = configure_resized_vocab_size(
original_vocab_size=151936,
tokenizer_len=152000,
pad_to_multiple_of=64,
)
# new_size = 152064 (next multiple of 64 >= 152000)