Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft DeepSpeedExamples Get DS Model

From Leeroopedia


Overview

Concrete tool for initializing a DeepSpeed ZeRO Stage 3 wrapped model for inference.

Description

The get_ds_model function orchestrates the full initialization pipeline for creating a DeepSpeed-wrapped model ready for inference. It performs the following sequence:

  1. Load model configuration via get_model_config() to obtain architectural dimensions.
  2. Initialize distributed backend via deepspeed.init_distributed("nccl").
  3. Determine precision from config.torch_dtype (defaults to FP16 if unset).
  4. Build DeepSpeed configuration dictionary with ZeRO Stage 3 settings, including prefetch buffer sizes derived from the model's hidden size.
  5. Apply quantization config if 4-bit quantization is requested (via get_quant_config()).
  6. Configure offload target (CPU with optional pinned memory, or NVMe with async I/O and model-specific buffer sizes).
  7. Register HfDeepSpeedConfig to signal HuggingFace to distribute weights during from_pretrained.
  8. Clear GPU cache to maximize available memory before model loading.
  9. Load model weights using the appropriate HuggingFace model class (BloomForCausalLM, OPTForCausalLM, LlamaForCausalLM, or AutoModelForCausalLM for Mixtral).
  10. Initialize DeepSpeed engine via deepspeed.initialize() and set model to eval mode.

Code Reference

Source

Repository File Lines
DeepSpeedExamples inference/huggingface/zero_inference/run_model.py 61-170

Signature

def get_ds_model(
    model_name,       # str: HuggingFace model identifier
    cpu_offload,      # bool: whether to offload parameters to CPU
    disk_offload,     # bool: whether to offload parameters to NVMe
    offload_dir,      # str: directory path for NVMe offloading
    dummy_weights,    # str or None: path to dummy weights for benchmarking
    bits,             # int: quantization bit width (4, 8, or 16)
    group_size,       # int: quantization group size
):
    """Initialize a DeepSpeed ZeRO Stage 3 model for inference.

    Returns:
        model (nn.Module): DeepSpeed-wrapped model in eval mode
    """

Import

# get_ds_model is defined directly in run_model.py and uses:
import deepspeed
from deepspeed.accelerator import get_accelerator
from transformers import (AutoConfig, BloomForCausalLM, OPTForCausalLM,
                          LlamaForCausalLM, AutoModelForCausalLM)
from transformers.integrations.deepspeed import HfDeepSpeedConfig
from utils import GB, get_quant_config

I/O Contract

Inputs

Parameter Type Required Description
model_name str Yes HuggingFace model identifier (e.g., "facebook/opt-66b")
cpu_offload bool Yes Enable CPU offloading of model parameters
disk_offload bool Yes Enable NVMe disk offloading of model parameters
offload_dir str Yes Filesystem path for NVMe offload storage
dummy_weights str or None Yes Path to dummy weights (for benchmarking) or None (use real weights)
bits int Yes Weight quantization precision: 4 for INT4, 16 for FP16
group_size int Yes Number of weights per quantization group (e.g., 64)

Global dependency: The function also reads args.batch_size, args.pin_memory, and args.use_gds from the global args namespace (parsed via argparse).

Outputs

Name Type Description
model nn.Module DeepSpeed ZeRO Stage 3 wrapped model in eval mode, with parameters partitioned across GPUs and optionally offloaded to CPU/NVMe

Internal Flow

The following illustrates the initialization sequence:

def get_ds_model(model_name, cpu_offload, disk_offload, offload_dir,
                 dummy_weights, bits, group_size):
    # Step 1: Load configuration
    config = get_model_config(model_name)
    hidden_size = config.hidden_size

    # Step 2: Initialize distributed
    deepspeed.init_distributed("nccl")

    # Step 3: Determine dtype
    dtype = config.torch_dtype or torch.float16

    # Step 4: Build ZeRO Stage 3 config
    ds_config = {
        "fp16": {"enabled": dtype == torch.float16},
        "bf16": {"enabled": dtype == torch.bfloat16},
        "zero_optimization": {
            "stage": 3,
            "stage3_prefetch_bucket_size": 2 * hidden_size * hidden_size,
            "stage3_param_persistence_threshold": hidden_size,
            "stage3_max_live_parameters": 2 * hidden_size * hidden_size,
        },
        "train_batch_size": args.batch_size,
    }

    # Step 5: Apply quantization (if bits == 4)
    if bits == 4:
        ds_config.update(get_quant_config(config, bits, group_size))

    # Step 6: Configure offload target
    if cpu_offload:
        ds_config["zero_optimization"]["offload_param"] = {
            "device": "cpu", "pin_memory": bool(args.pin_memory)
        }
    if disk_offload:
        ds_config["zero_optimization"]["offload_param"] = {
            "device": "nvme", "nvme_path": offload_dir, ...
        }
        ds_config["aio"] = { "block_size": 16*1048576, ... }

    # Step 7: Register with HuggingFace
    dschf = HfDeepSpeedConfig(ds_config)

    # Step 8: Clear GPU cache
    get_accelerator().empty_cache()

    # Step 9: Load model (dispatched by model_type)
    model = ModelClass.from_pretrained(
        dummy_weights or model_name, torch_dtype=dtype
    )
    model = model.eval()

    # Step 10: Initialize DeepSpeed engine
    ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
    ds_engine.module.eval()
    return ds_engine.module

DeepSpeed Configuration Structure

The complete ds_config dictionary built by this function has the following structure:

{
    "fp16": {"enabled": True},        # or bf16 based on model dtype
    "bf16": {"enabled": False},
    "zero_optimization": {
        "stage": 3,
        "stage3_prefetch_bucket_size": 2 * H * H,
        "stage3_param_persistence_threshold": H,
        "stage3_max_live_parameters": 2 * H * H,
        "offload_param": {             # present only if offloading
            "device": "cpu",           # or "nvme"
            "pin_memory": False,
            # NVMe-only fields:
            "nvme_path": "/path/to/offload",
            "buffer_count": 5,
            "buffer_size": 2147483648, # 2 GB
        },
    },
    "steps_per_print": 2000,
    "train_batch_size": 8,
    "wall_clock_breakdown": False,
    # Present only if bits == 4:
    "weight_quantization": {
        "quantized_initialization": {
            "num_bits": 4,
            "group_size": 64,
            "group_dim": 1,
            "symmetric": False,
        }
    },
    # Present only for NVMe offloading:
    "aio": {
        "block_size": 16777216,        # 16 MB
        "queue_depth": 64,
        "thread_count": 8,
        "use_gds": False,
        "single_submit": False,
        "overlap_events": True,
    },
}

Model Class Dispatch

config.model_type Model Class Examples
"bloom" or "bloom-7b1" BloomForCausalLM bigscience/bloom, bigscience/bloom-7b1
"opt" OPTForCausalLM facebook/opt-66b, facebook/opt-175b
"llama" LlamaForCausalLM meta-llama/Llama-2-70b-hf
"mixtral" AutoModelForCausalLM mistralai/Mixtral-8x7B
Other Raises ValueError --

Usage Example

import torch

# Initialize OPT-175B with CPU offload and 4-bit quantization
with torch.no_grad():
    model = get_ds_model(
        model_name="facebook/opt-175b",
        cpu_offload=True,
        disk_offload=False,
        offload_dir="~/offload_dir",
        dummy_weights=None,       # use real weights
        bits=4,                   # 4-bit quantization
        group_size=64,
    )

# model is now a DeepSpeed-wrapped OPTForCausalLM in eval mode
# Parameters are partitioned across GPUs with CPU offloading
# Weights are quantized to INT4 with group size 64

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment