Implementation:Microsoft DeepSpeedExamples Get DS Model

Overview

Concrete tool for initializing a DeepSpeed ZeRO Stage 3 wrapped model for inference.

Description

The get_ds_model function orchestrates the full initialization pipeline for creating a DeepSpeed-wrapped model ready for inference. It performs the following sequence:

Load model configuration via get_model_config() to obtain architectural dimensions.
Initialize distributed backend via deepspeed.init_distributed("nccl").
Determine precision from config.torch_dtype (defaults to FP16 if unset).
Build DeepSpeed configuration dictionary with ZeRO Stage 3 settings, including prefetch buffer sizes derived from the model's hidden size.
Apply quantization config if 4-bit quantization is requested (via get_quant_config()).
Configure offload target (CPU with optional pinned memory, or NVMe with async I/O and model-specific buffer sizes).
Register HfDeepSpeedConfig to signal HuggingFace to distribute weights during from_pretrained.
Clear GPU cache to maximize available memory before model loading.
Load model weights using the appropriate HuggingFace model class (BloomForCausalLM, OPTForCausalLM, LlamaForCausalLM, or AutoModelForCausalLM for Mixtral).
Initialize DeepSpeed engine via deepspeed.initialize() and set model to eval mode.

Code Reference

Source

Repository	File	Lines
DeepSpeedExamples	`inference/huggingface/zero_inference/run_model.py`	61-170

Signature

def get_ds_model(
    model_name,       # str: HuggingFace model identifier
    cpu_offload,      # bool: whether to offload parameters to CPU
    disk_offload,     # bool: whether to offload parameters to NVMe
    offload_dir,      # str: directory path for NVMe offloading
    dummy_weights,    # str or None: path to dummy weights for benchmarking
    bits,             # int: quantization bit width (4, 8, or 16)
    group_size,       # int: quantization group size
):
    """Initialize a DeepSpeed ZeRO Stage 3 model for inference.

    Returns:
        model (nn.Module): DeepSpeed-wrapped model in eval mode
    """

Import

# get_ds_model is defined directly in run_model.py and uses:
import deepspeed
from deepspeed.accelerator import get_accelerator
from transformers import (AutoConfig, BloomForCausalLM, OPTForCausalLM,
                          LlamaForCausalLM, AutoModelForCausalLM)
from transformers.integrations.deepspeed import HfDeepSpeedConfig
from utils import GB, get_quant_config

I/O Contract

Inputs

Parameter	Type	Required	Description
`model_name`	`str`	Yes	HuggingFace model identifier (e.g., `"facebook/opt-66b"`)
`cpu_offload`	`bool`	Yes	Enable CPU offloading of model parameters
`disk_offload`	`bool`	Yes	Enable NVMe disk offloading of model parameters
`offload_dir`	`str`	Yes	Filesystem path for NVMe offload storage
`dummy_weights`	`str` or `None`	Yes	Path to dummy weights (for benchmarking) or `None` (use real weights)
`bits`	`int`	Yes	Weight quantization precision: 4 for INT4, 16 for FP16
`group_size`	`int`	Yes	Number of weights per quantization group (e.g., 64)

Global dependency: The function also reads args.batch_size, args.pin_memory, and args.use_gds from the global args namespace (parsed via argparse).

Outputs

Name	Type	Description
`model`	`nn.Module`	DeepSpeed ZeRO Stage 3 wrapped model in eval mode, with parameters partitioned across GPUs and optionally offloaded to CPU/NVMe

Internal Flow

The following illustrates the initialization sequence:

def get_ds_model(model_name, cpu_offload, disk_offload, offload_dir,
                 dummy_weights, bits, group_size):
    # Step 1: Load configuration
    config = get_model_config(model_name)
    hidden_size = config.hidden_size

    # Step 2: Initialize distributed
    deepspeed.init_distributed("nccl")

    # Step 3: Determine dtype
    dtype = config.torch_dtype or torch.float16

    # Step 4: Build ZeRO Stage 3 config
    ds_config = {
        "fp16": {"enabled": dtype == torch.float16},
        "bf16": {"enabled": dtype == torch.bfloat16},
        "zero_optimization": {
            "stage": 3,
            "stage3_prefetch_bucket_size": 2 * hidden_size * hidden_size,
            "stage3_param_persistence_threshold": hidden_size,
            "stage3_max_live_parameters": 2 * hidden_size * hidden_size,
        },
        "train_batch_size": args.batch_size,
    }

    # Step 5: Apply quantization (if bits == 4)
    if bits == 4:
        ds_config.update(get_quant_config(config, bits, group_size))

    # Step 6: Configure offload target
    if cpu_offload:
        ds_config["zero_optimization"]["offload_param"] = {
            "device": "cpu", "pin_memory": bool(args.pin_memory)
        }
    if disk_offload:
        ds_config["zero_optimization"]["offload_param"] = {
            "device": "nvme", "nvme_path": offload_dir, ...
        }
        ds_config["aio"] = { "block_size": 16*1048576, ... }

    # Step 7: Register with HuggingFace
    dschf = HfDeepSpeedConfig(ds_config)

    # Step 8: Clear GPU cache
    get_accelerator().empty_cache()

    # Step 9: Load model (dispatched by model_type)
    model = ModelClass.from_pretrained(
        dummy_weights or model_name, torch_dtype=dtype
    )
    model = model.eval()

    # Step 10: Initialize DeepSpeed engine
    ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
    ds_engine.module.eval()
    return ds_engine.module

DeepSpeed Configuration Structure

The complete ds_config dictionary built by this function has the following structure:

{
    "fp16": {"enabled": True},        # or bf16 based on model dtype
    "bf16": {"enabled": False},
    "zero_optimization": {
        "stage": 3,
        "stage3_prefetch_bucket_size": 2 * H * H,
        "stage3_param_persistence_threshold": H,
        "stage3_max_live_parameters": 2 * H * H,
        "offload_param": {             # present only if offloading
            "device": "cpu",           # or "nvme"
            "pin_memory": False,
            # NVMe-only fields:
            "nvme_path": "/path/to/offload",
            "buffer_count": 5,
            "buffer_size": 2147483648, # 2 GB
        },
    },
    "steps_per_print": 2000,
    "train_batch_size": 8,
    "wall_clock_breakdown": False,
    # Present only if bits == 4:
    "weight_quantization": {
        "quantized_initialization": {
            "num_bits": 4,
            "group_size": 64,
            "group_dim": 1,
            "symmetric": False,
        }
    },
    # Present only for NVMe offloading:
    "aio": {
        "block_size": 16777216,        # 16 MB
        "queue_depth": 64,
        "thread_count": 8,
        "use_gds": False,
        "single_submit": False,
        "overlap_events": True,
    },
}

Model Class Dispatch

`config.model_type`	Model Class	Examples
`"bloom"` or `"bloom-7b1"`	`BloomForCausalLM`	bigscience/bloom, bigscience/bloom-7b1
`"opt"`	`OPTForCausalLM`	facebook/opt-66b, facebook/opt-175b
`"llama"`	`LlamaForCausalLM`	meta-llama/Llama-2-70b-hf
`"mixtral"`	`AutoModelForCausalLM`	mistralai/Mixtral-8x7B
Other	Raises `ValueError`	--

Usage Example

import torch

# Initialize OPT-175B with CPU offload and 4-bit quantization
with torch.no_grad():
    model = get_ds_model(
        model_name="facebook/opt-175b",
        cpu_offload=True,
        disk_offload=False,
        offload_dir="~/offload_dir",
        dummy_weights=None,       # use real weights
        bits=4,                   # 4-bit quantization
        group_size=64,
    )

# model is now a DeepSpeed-wrapped OPTForCausalLM in eval mode
# Parameters are partitioned across GPUs with CPU offloading
# Weights are quantized to INT4 with group size 64

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment