Implementation:Microsoft DeepSpeedExamples Get Model Config

Overview

Concrete tool for loading model configuration and tokenizer for ZeRO-Inference.

Description

The get_model_config and get_tokenizer functions provide the configuration and tokenizer loading step of the ZeRO-Inference pipeline. Together, they extract architectural metadata and text encoding capabilities from HuggingFace model identifiers, handling special cases where model artifacts are not directly available (e.g., OPT-175B).

The get_model_config function:

Checks if the model name contains "175b", indicating the OPT-175B special case.
For OPT-175B: loads facebook/opt-66b configuration and overrides five architectural parameters to match the 175B variant.
For all other models: loads configuration directly via AutoConfig.from_pretrained with trust_remote_code=True.
Explicitly sets config.model_type = 'bloom' for BLOOM models (ensuring consistent type detection).

The get_tokenizer function:

For OPT models: substitutes "175b" with "66b" in the model name to load from a publicly available tokenizer.
For all other models: loads the tokenizer directly from the model name.
Sets tokenizer.pad_token = tokenizer.eos_token for consistent padding during batch encoding.
Configures padding_side="left" for OPT models (required for correct causal LM generation with padding).

Code Reference

Source

Repository	File	Lines
DeepSpeedExamples	`inference/huggingface/zero_inference/run_model.py`	30-59

Signature: get_model_config

def get_model_config(model_name):
    """Load model configuration, with special-case handling for OPT-175B.

    For OPT-175B: loads opt-66b config then overrides
    hidden_size=12288, word_embed_proj_dim=12288, ffn_dim=49152,
    num_attention_heads=96, num_hidden_layers=96.

    For all other models: loads config directly via AutoConfig.
    """
    if "175b" in model_name:
        config = AutoConfig.from_pretrained("facebook/opt-66b")
        config.hidden_size = 12288
        config.word_embed_proj_dim = 12288
        config.ffn_dim = 12288 * 4
        config.num_attention_heads = 96
        config.num_hidden_layers = 96
    else:
        config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)

    if 'bloom' in model_name:
        config.model_type = 'bloom'

    return config

Signature: get_tokenizer

def get_tokenizer(model_name, config):
    """Load tokenizer with model-specific handling.

    For OPT models: substitutes opt-175b -> opt-66b for tokenizer source,
    uses left padding for proper causal LM generation.

    For all other models: loads tokenizer directly.
    Sets pad_token = eos_token for all models.
    """
    if config.model_type == "opt":
        tokenizer = AutoTokenizer.from_pretrained(
            model_name.replace("175b", "66b"),
            padding_side="left"
        )
    else:
        tokenizer = AutoTokenizer.from_pretrained(model_name)

    tokenizer.pad_token = tokenizer.eos_token
    return tokenizer

Import

# These functions are defined directly in run_model.py
# They use the following imports:
from transformers import AutoConfig, AutoTokenizer

I/O Contract

get_model_config

Inputs:

Parameter	Type	Required	Description
`model_name`	`str`	Yes	HuggingFace model identifier (e.g., `"facebook/opt-66b"`, `"bigscience/bloom"`, `"meta-llama/Llama-2-70b-hf"`)

Outputs:

Name	Type	Description
`config`	`AutoConfig`	Model configuration object with `hidden_size`, `num_hidden_layers`, `num_attention_heads`, `vocab_size`, `model_type`, and `torch_dtype` attributes

get_tokenizer

Inputs:

Parameter	Type	Required	Description
`model_name`	`str`	Yes	HuggingFace model identifier
`config`	`AutoConfig`	Yes	Model configuration (used to determine `model_type` for OPT special-casing)

Outputs:

Name	Type	Description
`tokenizer`	`AutoTokenizer`	Tokenizer with `pad_token` set to `eos_token`; OPT models use left-side padding

Usage Examples

Standard Model Configuration

# Load configuration for LLaMA-2-70B
config = get_model_config("meta-llama/Llama-2-70b-hf")
# config.hidden_size = 8192
# config.num_hidden_layers = 80
# config.num_attention_heads = 64
# config.model_type = "llama"

tokenizer = get_tokenizer("meta-llama/Llama-2-70b-hf", config)
# tokenizer.pad_token == tokenizer.eos_token

OPT-175B Special Case

# OPT-175B: loads from opt-66b and overrides dimensions
config = get_model_config("facebook/opt-175b")
# config.hidden_size = 12288  (overridden from opt-66b's 9216)
# config.num_hidden_layers = 96  (overridden from opt-66b's 64)
# config.num_attention_heads = 96  (overridden from opt-66b's 72)
# config.ffn_dim = 49152  (overridden from opt-66b's 36864)
# config.model_type = "opt"

tokenizer = get_tokenizer("facebook/opt-175b", config)
# Actually loads from "facebook/opt-66b" with padding_side="left"

BLOOM Model Configuration

# BLOOM: explicit model_type override
config = get_model_config("bigscience/bloom")
# config.model_type = 'bloom'  (explicitly set)

tokenizer = get_tokenizer("bigscience/bloom", config)

Design Decisions

OPT-175B proxy configuration: Since OPT-175B weights and configuration are not publicly available on HuggingFace, the code derives its architecture from OPT-66B. This approach avoids hardcoding the full configuration while leveraging the published architectural scaling pattern.
Left padding for OPT: OPT models require left-side padding because the causal language model generates tokens from left to right, and right-padded inputs would place padding tokens before the generation position.
trust_remote_code=True: Enabled for non-OPT-175B models to support architectures (like Mixtral) that require custom modeling code hosted on HuggingFace.
Explicit BLOOM model_type: The BLOOM model's configuration may not always set model_type correctly, so it is explicitly overridden when 'bloom' appears in the model name.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment