Implementation:Microsoft DeepSpeedExamples Get DS Model
Appearance
Overview
Concrete tool for initializing a DeepSpeed ZeRO Stage 3 wrapped model for inference.
Description
The get_ds_model function orchestrates the full initialization pipeline for creating a DeepSpeed-wrapped model ready for inference. It performs the following sequence:
- Load model configuration via
get_model_config()to obtain architectural dimensions. - Initialize distributed backend via
deepspeed.init_distributed("nccl"). - Determine precision from
config.torch_dtype(defaults to FP16 if unset). - Build DeepSpeed configuration dictionary with ZeRO Stage 3 settings, including prefetch buffer sizes derived from the model's hidden size.
- Apply quantization config if 4-bit quantization is requested (via
get_quant_config()). - Configure offload target (CPU with optional pinned memory, or NVMe with async I/O and model-specific buffer sizes).
- Register HfDeepSpeedConfig to signal HuggingFace to distribute weights during
from_pretrained. - Clear GPU cache to maximize available memory before model loading.
- Load model weights using the appropriate HuggingFace model class (
BloomForCausalLM,OPTForCausalLM,LlamaForCausalLM, orAutoModelForCausalLMfor Mixtral). - Initialize DeepSpeed engine via
deepspeed.initialize()and set model to eval mode.
Code Reference
Source
| Repository | File | Lines |
|---|---|---|
| DeepSpeedExamples | inference/huggingface/zero_inference/run_model.py |
61-170 |
Signature
def get_ds_model(
model_name, # str: HuggingFace model identifier
cpu_offload, # bool: whether to offload parameters to CPU
disk_offload, # bool: whether to offload parameters to NVMe
offload_dir, # str: directory path for NVMe offloading
dummy_weights, # str or None: path to dummy weights for benchmarking
bits, # int: quantization bit width (4, 8, or 16)
group_size, # int: quantization group size
):
"""Initialize a DeepSpeed ZeRO Stage 3 model for inference.
Returns:
model (nn.Module): DeepSpeed-wrapped model in eval mode
"""
Import
# get_ds_model is defined directly in run_model.py and uses:
import deepspeed
from deepspeed.accelerator import get_accelerator
from transformers import (AutoConfig, BloomForCausalLM, OPTForCausalLM,
LlamaForCausalLM, AutoModelForCausalLM)
from transformers.integrations.deepspeed import HfDeepSpeedConfig
from utils import GB, get_quant_config
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
model_name |
str |
Yes | HuggingFace model identifier (e.g., "facebook/opt-66b")
|
cpu_offload |
bool |
Yes | Enable CPU offloading of model parameters |
disk_offload |
bool |
Yes | Enable NVMe disk offloading of model parameters |
offload_dir |
str |
Yes | Filesystem path for NVMe offload storage |
dummy_weights |
str or None |
Yes | Path to dummy weights (for benchmarking) or None (use real weights)
|
bits |
int |
Yes | Weight quantization precision: 4 for INT4, 16 for FP16 |
group_size |
int |
Yes | Number of weights per quantization group (e.g., 64) |
Global dependency: The function also reads args.batch_size, args.pin_memory, and args.use_gds from the global args namespace (parsed via argparse).
Outputs
| Name | Type | Description |
|---|---|---|
model |
nn.Module |
DeepSpeed ZeRO Stage 3 wrapped model in eval mode, with parameters partitioned across GPUs and optionally offloaded to CPU/NVMe |
Internal Flow
The following illustrates the initialization sequence:
def get_ds_model(model_name, cpu_offload, disk_offload, offload_dir,
dummy_weights, bits, group_size):
# Step 1: Load configuration
config = get_model_config(model_name)
hidden_size = config.hidden_size
# Step 2: Initialize distributed
deepspeed.init_distributed("nccl")
# Step 3: Determine dtype
dtype = config.torch_dtype or torch.float16
# Step 4: Build ZeRO Stage 3 config
ds_config = {
"fp16": {"enabled": dtype == torch.float16},
"bf16": {"enabled": dtype == torch.bfloat16},
"zero_optimization": {
"stage": 3,
"stage3_prefetch_bucket_size": 2 * hidden_size * hidden_size,
"stage3_param_persistence_threshold": hidden_size,
"stage3_max_live_parameters": 2 * hidden_size * hidden_size,
},
"train_batch_size": args.batch_size,
}
# Step 5: Apply quantization (if bits == 4)
if bits == 4:
ds_config.update(get_quant_config(config, bits, group_size))
# Step 6: Configure offload target
if cpu_offload:
ds_config["zero_optimization"]["offload_param"] = {
"device": "cpu", "pin_memory": bool(args.pin_memory)
}
if disk_offload:
ds_config["zero_optimization"]["offload_param"] = {
"device": "nvme", "nvme_path": offload_dir, ...
}
ds_config["aio"] = { "block_size": 16*1048576, ... }
# Step 7: Register with HuggingFace
dschf = HfDeepSpeedConfig(ds_config)
# Step 8: Clear GPU cache
get_accelerator().empty_cache()
# Step 9: Load model (dispatched by model_type)
model = ModelClass.from_pretrained(
dummy_weights or model_name, torch_dtype=dtype
)
model = model.eval()
# Step 10: Initialize DeepSpeed engine
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()
return ds_engine.module
DeepSpeed Configuration Structure
The complete ds_config dictionary built by this function has the following structure:
{
"fp16": {"enabled": True}, # or bf16 based on model dtype
"bf16": {"enabled": False},
"zero_optimization": {
"stage": 3,
"stage3_prefetch_bucket_size": 2 * H * H,
"stage3_param_persistence_threshold": H,
"stage3_max_live_parameters": 2 * H * H,
"offload_param": { # present only if offloading
"device": "cpu", # or "nvme"
"pin_memory": False,
# NVMe-only fields:
"nvme_path": "/path/to/offload",
"buffer_count": 5,
"buffer_size": 2147483648, # 2 GB
},
},
"steps_per_print": 2000,
"train_batch_size": 8,
"wall_clock_breakdown": False,
# Present only if bits == 4:
"weight_quantization": {
"quantized_initialization": {
"num_bits": 4,
"group_size": 64,
"group_dim": 1,
"symmetric": False,
}
},
# Present only for NVMe offloading:
"aio": {
"block_size": 16777216, # 16 MB
"queue_depth": 64,
"thread_count": 8,
"use_gds": False,
"single_submit": False,
"overlap_events": True,
},
}
Model Class Dispatch
config.model_type |
Model Class | Examples |
|---|---|---|
"bloom" or "bloom-7b1" |
BloomForCausalLM |
bigscience/bloom, bigscience/bloom-7b1 |
"opt" |
OPTForCausalLM |
facebook/opt-66b, facebook/opt-175b |
"llama" |
LlamaForCausalLM |
meta-llama/Llama-2-70b-hf |
"mixtral" |
AutoModelForCausalLM |
mistralai/Mixtral-8x7B |
| Other | Raises ValueError |
-- |
Usage Example
import torch
# Initialize OPT-175B with CPU offload and 4-bit quantization
with torch.no_grad():
model = get_ds_model(
model_name="facebook/opt-175b",
cpu_offload=True,
disk_offload=False,
offload_dir="~/offload_dir",
dummy_weights=None, # use real weights
bits=4, # 4-bit quantization
group_size=64,
)
# model is now a DeepSpeed-wrapped OPTForCausalLM in eval mode
# Parameters are partitioned across GPUs with CPU offloading
# Weights are quantized to INT4 with group size 64
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment