Implementation:Microsoft DeepSpeedExamples Launch Scripts ZeRO Inference

Overview

Concrete tool for launching ZeRO-Inference with preconfigured model-specific settings via shell scripts.

Description

The ZeRO-Inference launch scripts are a collection of shell scripts that invoke deepspeed --num_gpus N run_model.py with model-specific arguments. Each script is tailored to a specific model and hardware combination (NVIDIA A6000 with 48 GB HBM), configuring batch size, sequence lengths, offload strategy, quantization bits, and benchmark iterations.

The scripts follow a common pattern:

Environment setup: Disable TensorFlow (export USE_TF=0), set log directories, and create offload directories.
Model identification: Set MODEL_NAME and FULL_MODEL_NAME (HuggingFace model identifier).
Multi-configuration sweep: Launch multiple DeepSpeed runs with varying batch sizes, offload strategies, and quantization settings to benchmark throughput across configurations.
Output capture: Redirect stdout/stderr to log files organized by model name and batch size.

The available launch scripts cover the following models:

Script	Model	Parameters	Offload Strategies
`run_bloom175b_a6000.sh`	bigscience/bloom	176B	NVMe, CPU, CPU+quantized, CPU+KV, CPU+quantized+KV
`run_llama2_70b_a6000.sh`	meta-llama/Llama-2-70b-hf	70B	CPU, CPU+pinned, CPU+quantized, CPU+KV, CPU+quantized+KV
`run_opt175b_a6000.sh`	facebook/opt-175b	175B	NVMe, CPU+quantized, NVMe+KV, CPU+quantized+KV
`run_opt66b_a6000.sh`	facebook/opt-66b	66B	CPU+pinned, CPU+quantized, CPU+KV, CPU+quantized+KV
`run_opt30b_a6000.sh`	facebook/opt-30b	30B	CPU+pinned, CPU+quantized, CPU+KV, CPU+quantized+KV
`run_opt1p3b_a6000.sh`	facebook/opt-6.7b	6.7B	CPU+KV (development/testing)
`run_model.sh`	Configurable (default: facebook/opt-6.7b)	Any	Configurable via variables

Code Reference

Source

Repository	Files
DeepSpeedExamples	`inference/huggingface/zero_inference/run_bloom175b_a6000.sh`
DeepSpeedExamples	`inference/huggingface/zero_inference/run_llama2_70b_a6000.sh`
DeepSpeedExamples	`inference/huggingface/zero_inference/run_opt175b_a6000.sh`
DeepSpeedExamples	`inference/huggingface/zero_inference/run_opt66b_a6000.sh`
DeepSpeedExamples	`inference/huggingface/zero_inference/run_opt30b_a6000.sh`
DeepSpeedExamples	`inference/huggingface/zero_inference/run_opt1p3b_a6000.sh`
DeepSpeedExamples	`inference/huggingface/zero_inference/run_model.sh`

Usage

Running a Preconfigured Benchmark

To run the BLOOM-176B benchmark suite on a single A6000 GPU:

cd inference/huggingface/zero_inference/
bash run_bloom175b_a6000.sh

This executes four configurations in sequence:

NVMe offload with batch size 8
CPU offload with 4-bit quantization and batch size 4
NVMe offload with KV cache offloading and batch size 32
CPU offload with 4-bit quantization, KV cache offloading, and batch size 24

Running the Configurable Script

The run_model.sh script provides a variable-driven interface:

# Edit configuration variables at the top of run_model.sh
MODEL_NAME=facebook/opt-6.7b
BATCHSIZE=80
PROMPT_LEN=512
GEN_LEN=32
USE_CPU_OFFLOAD=1
USE_KV_OFFLOAD=1
USE_QUANT=0

# Then run
bash run_model.sh

Direct DeepSpeed Invocation

For ad-hoc runs, invoke DeepSpeed directly:

# OPT-13B with CPU offload, KV offload, and 4-bit quantization
deepspeed --num_gpus 1 run_model.py \
    --model facebook/opt-13b \
    --batch-size 16 \
    --prompt-len 512 \
    --gen-len 32 \
    --cpu-offload \
    --kv-offload \
    --quant_bits 4

# BLOOM-176B with NVMe offload and dummy weights for benchmarking
deepspeed --num_gpus 1 run_model.py \
    --dummy \
    --model bigscience/bloom \
    --batch-size 8 \
    --disk-offload \
    --gen-len 32 \
    --pin-memory 0 \
    --offload-dir /local_nvme/zero_offload

I/O Contract

Inputs

Input	Type	Description
Shell environment variables	Environment	`USE_TF=0` disables TensorFlow import; standard CUDA/NCCL variables
Model weights	Files / Network	HuggingFace model hub download or local path (skipped with `--dummy`)
NVMe offload directory	Directory	Writable path for parameter offloading (e.g., `/local_nvme/zero_offload`)

Outputs

Output	Type	Description
Benchmark log files	`.txt` files	Captured stdout/stderr per configuration in `~/experiments/zero_inference/`
Benchmark metric files	`.log` files	Structured metrics written by `write_benchmark_log()` (model size, latency, throughput)
Console output	Text	Summary of costs, prefill timings, and generated text (at verbose >= 2)

Example: BLOOM-176B Launch Script

The following shows the complete content of run_bloom175b_a6000.sh:

export USE_TF=0
BASE_LOG_DIR=~/experiments/zero_inference/
MODEL_NAME="bloom"
FULL_MODEL_NAME="bigscience/${MODEL_NAME}"

OFFLOAD_DIR=/local_nvme/zero_offload
mkdir -p $OFFLOAD_DIR

QB=4

# Config 1: NVMe offload, batch size 8
BSZ=8
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
    --batch-size ${BSZ} --disk-offload --gen-len 32 --pin-memory 0 \
    --offload-dir ${OFFLOAD_DIR} \
    &> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_disk.txt

# Config 2: CPU offload with 4-bit quantization, batch size 4
BSZ=4
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
    --batch-size ${BSZ} --cpu-offload --gen-len 32 --pin-memory 0 \
    --offload-dir ${OFFLOAD_DIR} --quant_bits ${QB} \
    &> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_cpu_q${QB}.txt

# Config 3: NVMe offload with KV cache offloading, batch size 32
BSZ=32
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
    --batch-size ${BSZ} --disk-offload --gen-len 32 --pin-memory 0 \
    --offload-dir ${OFFLOAD_DIR} --kv-offload \
    &> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_disk_kv.txt

# Config 4: CPU offload with quantization + KV offload, batch size 24
BSZ=24
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
    --batch-size ${BSZ} --cpu-offload --gen-len 32 --pin-memory 0 \
    --offload-dir ${OFFLOAD_DIR} --quant_bits ${QB} --kv-offload \
    &> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_cpu_q${QB}_kv.txt

Naming Convention

Log files follow the pattern:

ds_{model}_{bs}{batch_size}_{offload_type}[_kv][_q{bits}].txt

where:

{offload_type} is one of: cpu, cpu_pin, disk, gpu
_kv suffix indicates KV cache offloading is enabled
_q{bits} suffix indicates weight quantization with the specified bit width

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment