Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft DeepSpeedExamples ZeRO Model Configuration

From Leeroopedia


Sources

Domains

  • NLP
  • Model_Architecture
  • Inference

Overview

A configuration pattern for initializing model metadata and tokenizer settings before distributed weight loading.

Description

Before initializing a model in ZeRO-Inference, the model configuration and tokenizer must be loaded independently of the model weights. This separation is essential because:

  1. Memory estimation: The configuration provides the architectural parameters (hidden size, number of layers, number of attention heads) needed to calculate memory requirements and configure ZeRO Stage 3 buffer sizes before any weights are loaded.
  2. Config overrides: Some models require configuration modifications. For example, OPT-175B's configuration is not publicly available on HuggingFace, so it must be constructed by loading OPT-66B's configuration and overriding the dimension parameters to match the 175B variant.
  3. Tokenizer compatibility: The tokenizer must be loaded from a compatible model when the exact model weights are unavailable. OPT-175B uses the OPT-66B tokenizer since the 175B variant is not hosted on HuggingFace.
  4. Model type detection: The configuration's model_type attribute determines which model class to instantiate (BloomForCausalLM, OPTForCausalLM, LlamaForCausalLM, or AutoModelForCausalLM for Mixtral).

Configuration Flow

The configuration loading follows this sequence:

  1. Load AutoConfig from the HuggingFace model identifier (or override source for special cases).
  2. Apply any necessary dimension overrides (e.g., OPT-175B).
  3. Ensure model_type is correctly set (e.g., explicitly set to 'bloom' for BLOOM models).
  4. Load the tokenizer from the model identifier (or compatible substitute).
  5. Set pad_token = eos_token for consistent padding behavior.

Supported Model Types

Model Type Config Source Tokenizer Source Model Class
OPT (up to 66B) Direct from HuggingFace Direct from HuggingFace OPTForCausalLM
OPT-175B facebook/opt-66b + overrides facebook/opt-66b OPTForCausalLM
BLOOM Direct from HuggingFace Direct from HuggingFace BloomForCausalLM
LLaMA 2 Direct from HuggingFace Direct from HuggingFace LlamaForCausalLM
Mixtral Direct from HuggingFace (trust_remote_code=True) Direct from HuggingFace AutoModelForCausalLM

Theoretical Basis

Model configuration precedes weight loading in distributed settings because memory must be pre-allocated across devices. In ZeRO Stage 3, the DeepSpeed configuration requires knowledge of the model's hidden size to set:

  • stage3_prefetch_bucket_size: 2 * hidden_size * hidden_size -- the size of prefetch buffers for parameter gathering.
  • stage3_param_persistence_threshold: hidden_size -- parameters smaller than this threshold are kept on all ranks.
  • stage3_max_live_parameters: 2 * hidden_size * hidden_size -- the maximum number of parameters materialized at any time.

These values directly depend on architectural dimensions extracted from the configuration, making configuration loading a prerequisite for DeepSpeed initialization.

OPT-175B Configuration Derivation

The OPT-175B model follows the OPT architecture scaling conventions. Its configuration is derived from OPT-66B with the following overrides:

Parameter OPT-66B Value OPT-175B Value
hidden_size 9216 12288
word_embed_proj_dim 9216 12288
ffn_dim 36864 (9216 * 4) 49152 (12288 * 4)
num_attention_heads 72 96
num_hidden_layers 64 96

Key Configuration Attributes

Attribute Type Usage in ZeRO-Inference
hidden_size int Determines ZeRO Stage 3 buffer sizes; used for memory estimation
num_hidden_layers int Used for KV cache size calculation: cache = 2 * batch * seq_len * layers * hidden
num_attention_heads int Determines attention computation shape
vocab_size int Used in total model size estimation (embedding layer)
model_type str Determines which model class and NVMe buffer configuration to use
torch_dtype torch.dtype Determines FP16 vs BF16 inference; defaults to torch.float16 if not set

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment