Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft DeepSpeedExamples Launch Scripts ZeRO Inference

From Leeroopedia


Overview

Concrete tool for launching ZeRO-Inference with preconfigured model-specific settings via shell scripts.

Description

The ZeRO-Inference launch scripts are a collection of shell scripts that invoke deepspeed --num_gpus N run_model.py with model-specific arguments. Each script is tailored to a specific model and hardware combination (NVIDIA A6000 with 48 GB HBM), configuring batch size, sequence lengths, offload strategy, quantization bits, and benchmark iterations.

The scripts follow a common pattern:

  1. Environment setup: Disable TensorFlow (export USE_TF=0), set log directories, and create offload directories.
  2. Model identification: Set MODEL_NAME and FULL_MODEL_NAME (HuggingFace model identifier).
  3. Multi-configuration sweep: Launch multiple DeepSpeed runs with varying batch sizes, offload strategies, and quantization settings to benchmark throughput across configurations.
  4. Output capture: Redirect stdout/stderr to log files organized by model name and batch size.

The available launch scripts cover the following models:

Script Model Parameters Offload Strategies
run_bloom175b_a6000.sh bigscience/bloom 176B NVMe, CPU, CPU+quantized, CPU+KV, CPU+quantized+KV
run_llama2_70b_a6000.sh meta-llama/Llama-2-70b-hf 70B CPU, CPU+pinned, CPU+quantized, CPU+KV, CPU+quantized+KV
run_opt175b_a6000.sh facebook/opt-175b 175B NVMe, CPU+quantized, NVMe+KV, CPU+quantized+KV
run_opt66b_a6000.sh facebook/opt-66b 66B CPU+pinned, CPU+quantized, CPU+KV, CPU+quantized+KV
run_opt30b_a6000.sh facebook/opt-30b 30B CPU+pinned, CPU+quantized, CPU+KV, CPU+quantized+KV
run_opt1p3b_a6000.sh facebook/opt-6.7b 6.7B CPU+KV (development/testing)
run_model.sh Configurable (default: facebook/opt-6.7b) Any Configurable via variables

Code Reference

Source

Repository Files
DeepSpeedExamples inference/huggingface/zero_inference/run_bloom175b_a6000.sh
DeepSpeedExamples inference/huggingface/zero_inference/run_llama2_70b_a6000.sh
DeepSpeedExamples inference/huggingface/zero_inference/run_opt175b_a6000.sh
DeepSpeedExamples inference/huggingface/zero_inference/run_opt66b_a6000.sh
DeepSpeedExamples inference/huggingface/zero_inference/run_opt30b_a6000.sh
DeepSpeedExamples inference/huggingface/zero_inference/run_opt1p3b_a6000.sh
DeepSpeedExamples inference/huggingface/zero_inference/run_model.sh

Usage

Running a Preconfigured Benchmark

To run the BLOOM-176B benchmark suite on a single A6000 GPU:

cd inference/huggingface/zero_inference/
bash run_bloom175b_a6000.sh

This executes four configurations in sequence:

  1. NVMe offload with batch size 8
  2. CPU offload with 4-bit quantization and batch size 4
  3. NVMe offload with KV cache offloading and batch size 32
  4. CPU offload with 4-bit quantization, KV cache offloading, and batch size 24

Running the Configurable Script

The run_model.sh script provides a variable-driven interface:

# Edit configuration variables at the top of run_model.sh
MODEL_NAME=facebook/opt-6.7b
BATCHSIZE=80
PROMPT_LEN=512
GEN_LEN=32
USE_CPU_OFFLOAD=1
USE_KV_OFFLOAD=1
USE_QUANT=0

# Then run
bash run_model.sh

Direct DeepSpeed Invocation

For ad-hoc runs, invoke DeepSpeed directly:

# OPT-13B with CPU offload, KV offload, and 4-bit quantization
deepspeed --num_gpus 1 run_model.py \
    --model facebook/opt-13b \
    --batch-size 16 \
    --prompt-len 512 \
    --gen-len 32 \
    --cpu-offload \
    --kv-offload \
    --quant_bits 4
# BLOOM-176B with NVMe offload and dummy weights for benchmarking
deepspeed --num_gpus 1 run_model.py \
    --dummy \
    --model bigscience/bloom \
    --batch-size 8 \
    --disk-offload \
    --gen-len 32 \
    --pin-memory 0 \
    --offload-dir /local_nvme/zero_offload

I/O Contract

Inputs

Input Type Description
Shell environment variables Environment USE_TF=0 disables TensorFlow import; standard CUDA/NCCL variables
Model weights Files / Network HuggingFace model hub download or local path (skipped with --dummy)
NVMe offload directory Directory Writable path for parameter offloading (e.g., /local_nvme/zero_offload)

Outputs

Output Type Description
Benchmark log files .txt files Captured stdout/stderr per configuration in ~/experiments/zero_inference/
Benchmark metric files .log files Structured metrics written by write_benchmark_log() (model size, latency, throughput)
Console output Text Summary of costs, prefill timings, and generated text (at verbose >= 2)

Example: BLOOM-176B Launch Script

The following shows the complete content of run_bloom175b_a6000.sh:

export USE_TF=0
BASE_LOG_DIR=~/experiments/zero_inference/
MODEL_NAME="bloom"
FULL_MODEL_NAME="bigscience/${MODEL_NAME}"

OFFLOAD_DIR=/local_nvme/zero_offload
mkdir -p $OFFLOAD_DIR

QB=4

# Config 1: NVMe offload, batch size 8
BSZ=8
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
    --batch-size ${BSZ} --disk-offload --gen-len 32 --pin-memory 0 \
    --offload-dir ${OFFLOAD_DIR} \
    &> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_disk.txt

# Config 2: CPU offload with 4-bit quantization, batch size 4
BSZ=4
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
    --batch-size ${BSZ} --cpu-offload --gen-len 32 --pin-memory 0 \
    --offload-dir ${OFFLOAD_DIR} --quant_bits ${QB} \
    &> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_cpu_q${QB}.txt

# Config 3: NVMe offload with KV cache offloading, batch size 32
BSZ=32
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
    --batch-size ${BSZ} --disk-offload --gen-len 32 --pin-memory 0 \
    --offload-dir ${OFFLOAD_DIR} --kv-offload \
    &> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_disk_kv.txt

# Config 4: CPU offload with quantization + KV offload, batch size 24
BSZ=24
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
    --batch-size ${BSZ} --cpu-offload --gen-len 32 --pin-memory 0 \
    --offload-dir ${OFFLOAD_DIR} --quant_bits ${QB} --kv-offload \
    &> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_cpu_q${QB}_kv.txt

Naming Convention

Log files follow the pattern:

ds_{model}_{bs}{batch_size}_{offload_type}[_kv][_q{bits}].txt

where:

  • {offload_type} is one of: cpu, cpu_pin, disk, gpu
  • _kv suffix indicates KV cache offloading is enabled
  • _q{bits} suffix indicates weight quantization with the specified bit width

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment