Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL ModelUtils

From Leeroopedia


Knowledge Sources
Domains Model_Architecture, Utilities
Last Updated 2026-02-07 20:00 GMT

Overview

Utility mixins and functions for parameter counting, FLOPs estimation, attention backend configuration, RMS normalization, context parallel sequence sharding, and vocabulary resizing.

Description

model_utils.py provides foundational utility classes and functions used across the MCoreAdapter model infrastructure:

ModuleUtilsMixin (lines 22-91) is a HuggingFace-inspired mixin class that adds model introspection capabilities to Megatron modules:

  • num_parameters() counts total or trainable-only parameters, optionally excluding embeddings
  • estimate_tokens() estimates token count from the main input tensor
  • floating_point_ops() computes approximate FLOPs using the 6 * tokens * params formula

RMSNorm (lines 94-109) is a Root Mean Square Layer Normalization implementation compatible with Megatron-Core's TransformerConfig. It supports CPU initialization, configurable epsilon, and marks its weight with the sequence_parallel attribute for proper gradient handling.

The module also provides several standalone utility functions:

  • exists_hf_config() and exists_mca_config() check for the presence of HuggingFace or MCA config files in a checkpoint directory
  • check_and_get_attention_backend_by_env() determines the attention backend from NVTE environment variables (NVTE_FLASH_ATTN, NVTE_FUSED_ATTN, NVTE_UNFUSED_ATTN)
  • get_thd_data_on_this_cp_rank() shards THD-format packed sequence data for context parallelism using Transformer Engine's thd_get_partitioned_indices
  • configure_resized_vocab_size() computes a padded vocabulary size when the tokenizer is larger than the model's original vocabulary

Additionally, internal helper _McaLoraLogitsHelper (lines 112-122) is a custom torch.autograd.Function that ensures gradient contiguity for LoRA logits in tensor-parallel settings, and mca_lora_logits_postprocess_hook (lines 137-140) is a forward hook that applies this fix.

Usage

Use ModuleUtilsMixin as a base class for any model that needs parameter counting and FLOPs estimation. Use RMSNorm as a drop-in replacement for Megatron-Core's default layer normalization when using the local transformer implementation. The utility functions are called internally by McaModelConfig and PretrainedModel during model loading and configuration.

Code Reference

Source Location

Key Classes

ModuleUtilsMixin

class ModuleUtilsMixin:
    main_input_name: str = "input_ids"

Key methods:

  • num_parameters(only_trainable, exclude_embeddings) (lines 29-44): Counts parameters. When exclude_embeddings=True, excludes nn.Embedding weights. When only_trainable=True, only counts parameters with requires_grad=True.
  • estimate_tokens(input_dict) (lines 46-65): Returns input_dict[main_input_name].numel(). Issues a one-time warning if the main input key is not found.
  • floating_point_ops(input_dict, exclude_embeddings) (lines 67-91): Returns 6 * estimated_tokens * num_parameters as an approximation valid when 12 * d_model << sequence_length.

RMSNorm

class RMSNorm(nn.Module):
    def __init__(self, config: "TransformerConfig", hidden_size, eps=1e-6, **kwargs)

Root Mean Square Layer Normalization. Creates a learnable weight parameter initialized to ones. The forward pass computes weight * (x / sqrt(mean(x^2) + eps)), converting to float32 for numerical stability.

Key Functions

exists_hf_config

def exists_hf_config(model_name_or_path: str) -> bool  # line 143

Returns True if config.json exists in the given directory.

exists_mca_config

def exists_mca_config(model_name_or_path: str) -> bool  # line 147

Returns True if mca_config.json exists in the given directory.

check_and_get_attention_backend_by_env

def check_and_get_attention_backend_by_env(attention_backend: AttnBackend) -> AttnBackend  # lines 151-170

Resolves the attention backend based on NVTE environment variables when attention_backend is auto. Returns AttnBackend.flash, AttnBackend.fused, AttnBackend.unfused, AttnBackend.local, or AttnBackend.auto.

get_thd_data_on_this_cp_rank

def get_thd_data_on_this_cp_rank(
    batch: Dict[str, "torch.Tensor"],
    packed_seq_params: PackedSeqParams,
    dim3_keys: List[str] = ["attention_mask"],
) -> Dict[str, "torch.Tensor"]  # lines 173-196

Shards THD (Token-Head-Dimension) format data for context parallelism. Uses Transformer Engine's thd_get_partitioned_indices to compute sequence indices for each CP rank, then index-selects along the sequence dimension.

configure_resized_vocab_size

def configure_resized_vocab_size(
    original_vocab_size: int,
    tokenizer_len: int,
    pad_to_multiple_of: int = 64,
) -> Optional[int]  # lines 199-213

Returns a new vocabulary size padded to pad_to_multiple_of when the tokenizer is larger than the original vocab. Returns None if no resizing is needed.

Import

import torch
import torch.nn as nn
from megatron.core import mpu
from megatron.core.packed_seq_params import PackedSeqParams
from megatron.core.transformer.enums import AttnBackend

from mcore_adapter.models.model_utils import (
    ModuleUtilsMixin, RMSNorm,
    exists_hf_config, exists_mca_config,
    check_and_get_attention_backend_by_env,
    get_thd_data_on_this_cp_rank,
    configure_resized_vocab_size,
)

I/O Contract

Inputs

Name Type Required Description
only_trainable bool No If True, count only parameters with requires_grad (default: False)
exclude_embeddings bool No If True, exclude embedding weights from count (default: False)
input_dict Dict[str, Any] Yes Model input dictionary containing the main input tensor
model_name_or_path str Yes Path to model checkpoint directory
attention_backend AttnBackend Yes Current attention backend setting (auto triggers env var resolution)
original_vocab_size int Yes Original vocabulary size from model config
tokenizer_len int Yes Length of the tokenizer vocabulary

Outputs

Name Type Description
num_parameters int Total parameter count
estimated_tokens int Estimated number of tokens in the input batch
flops int Approximate floating-point operations
exists bool Whether a config file exists
backend AttnBackend Resolved attention backend
resized_vocab Optional[int] New vocab size or None if no resize needed

Usage Examples

from mcore_adapter.models.model_utils import (
    ModuleUtilsMixin, RMSNorm, configure_resized_vocab_size,
    check_and_get_attention_backend_by_env,
)
from megatron.core.transformer.enums import AttnBackend

# Parameter counting via mixin
class MyModel(nn.Module, ModuleUtilsMixin):
    pass
model = MyModel()
total_params = model.num_parameters()
trainable_params = model.num_parameters(only_trainable=True)

# Check attention backend
backend = check_and_get_attention_backend_by_env(AttnBackend.auto)

# Vocab resizing
new_size = configure_resized_vocab_size(
    original_vocab_size=151936,
    tokenizer_len=152000,
    pad_to_multiple_of=64,
)
# new_size = 152064 (next multiple of 64 >= 152000)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment