Implementation:Microsoft DeepSpeedExamples Launch Scripts ZeRO Inference
Overview
Concrete tool for launching ZeRO-Inference with preconfigured model-specific settings via shell scripts.
Description
The ZeRO-Inference launch scripts are a collection of shell scripts that invoke deepspeed --num_gpus N run_model.py with model-specific arguments. Each script is tailored to a specific model and hardware combination (NVIDIA A6000 with 48 GB HBM), configuring batch size, sequence lengths, offload strategy, quantization bits, and benchmark iterations.
The scripts follow a common pattern:
- Environment setup: Disable TensorFlow (
export USE_TF=0), set log directories, and create offload directories. - Model identification: Set
MODEL_NAMEandFULL_MODEL_NAME(HuggingFace model identifier). - Multi-configuration sweep: Launch multiple DeepSpeed runs with varying batch sizes, offload strategies, and quantization settings to benchmark throughput across configurations.
- Output capture: Redirect stdout/stderr to log files organized by model name and batch size.
The available launch scripts cover the following models:
| Script | Model | Parameters | Offload Strategies |
|---|---|---|---|
run_bloom175b_a6000.sh |
bigscience/bloom | 176B | NVMe, CPU, CPU+quantized, CPU+KV, CPU+quantized+KV |
run_llama2_70b_a6000.sh |
meta-llama/Llama-2-70b-hf | 70B | CPU, CPU+pinned, CPU+quantized, CPU+KV, CPU+quantized+KV |
run_opt175b_a6000.sh |
facebook/opt-175b | 175B | NVMe, CPU+quantized, NVMe+KV, CPU+quantized+KV |
run_opt66b_a6000.sh |
facebook/opt-66b | 66B | CPU+pinned, CPU+quantized, CPU+KV, CPU+quantized+KV |
run_opt30b_a6000.sh |
facebook/opt-30b | 30B | CPU+pinned, CPU+quantized, CPU+KV, CPU+quantized+KV |
run_opt1p3b_a6000.sh |
facebook/opt-6.7b | 6.7B | CPU+KV (development/testing) |
run_model.sh |
Configurable (default: facebook/opt-6.7b) | Any | Configurable via variables |
Code Reference
Source
| Repository | Files |
|---|---|
| DeepSpeedExamples | inference/huggingface/zero_inference/run_bloom175b_a6000.sh
|
| DeepSpeedExamples | inference/huggingface/zero_inference/run_llama2_70b_a6000.sh
|
| DeepSpeedExamples | inference/huggingface/zero_inference/run_opt175b_a6000.sh
|
| DeepSpeedExamples | inference/huggingface/zero_inference/run_opt66b_a6000.sh
|
| DeepSpeedExamples | inference/huggingface/zero_inference/run_opt30b_a6000.sh
|
| DeepSpeedExamples | inference/huggingface/zero_inference/run_opt1p3b_a6000.sh
|
| DeepSpeedExamples | inference/huggingface/zero_inference/run_model.sh
|
Usage
Running a Preconfigured Benchmark
To run the BLOOM-176B benchmark suite on a single A6000 GPU:
cd inference/huggingface/zero_inference/
bash run_bloom175b_a6000.sh
This executes four configurations in sequence:
- NVMe offload with batch size 8
- CPU offload with 4-bit quantization and batch size 4
- NVMe offload with KV cache offloading and batch size 32
- CPU offload with 4-bit quantization, KV cache offloading, and batch size 24
Running the Configurable Script
The run_model.sh script provides a variable-driven interface:
# Edit configuration variables at the top of run_model.sh
MODEL_NAME=facebook/opt-6.7b
BATCHSIZE=80
PROMPT_LEN=512
GEN_LEN=32
USE_CPU_OFFLOAD=1
USE_KV_OFFLOAD=1
USE_QUANT=0
# Then run
bash run_model.sh
Direct DeepSpeed Invocation
For ad-hoc runs, invoke DeepSpeed directly:
# OPT-13B with CPU offload, KV offload, and 4-bit quantization
deepspeed --num_gpus 1 run_model.py \
--model facebook/opt-13b \
--batch-size 16 \
--prompt-len 512 \
--gen-len 32 \
--cpu-offload \
--kv-offload \
--quant_bits 4
# BLOOM-176B with NVMe offload and dummy weights for benchmarking
deepspeed --num_gpus 1 run_model.py \
--dummy \
--model bigscience/bloom \
--batch-size 8 \
--disk-offload \
--gen-len 32 \
--pin-memory 0 \
--offload-dir /local_nvme/zero_offload
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
| Shell environment variables | Environment | USE_TF=0 disables TensorFlow import; standard CUDA/NCCL variables
|
| Model weights | Files / Network | HuggingFace model hub download or local path (skipped with --dummy)
|
| NVMe offload directory | Directory | Writable path for parameter offloading (e.g., /local_nvme/zero_offload)
|
Outputs
| Output | Type | Description |
|---|---|---|
| Benchmark log files | .txt files |
Captured stdout/stderr per configuration in ~/experiments/zero_inference/
|
| Benchmark metric files | .log files |
Structured metrics written by write_benchmark_log() (model size, latency, throughput)
|
| Console output | Text | Summary of costs, prefill timings, and generated text (at verbose >= 2) |
Example: BLOOM-176B Launch Script
The following shows the complete content of run_bloom175b_a6000.sh:
export USE_TF=0
BASE_LOG_DIR=~/experiments/zero_inference/
MODEL_NAME="bloom"
FULL_MODEL_NAME="bigscience/${MODEL_NAME}"
OFFLOAD_DIR=/local_nvme/zero_offload
mkdir -p $OFFLOAD_DIR
QB=4
# Config 1: NVMe offload, batch size 8
BSZ=8
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
--batch-size ${BSZ} --disk-offload --gen-len 32 --pin-memory 0 \
--offload-dir ${OFFLOAD_DIR} \
&> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_disk.txt
# Config 2: CPU offload with 4-bit quantization, batch size 4
BSZ=4
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
--batch-size ${BSZ} --cpu-offload --gen-len 32 --pin-memory 0 \
--offload-dir ${OFFLOAD_DIR} --quant_bits ${QB} \
&> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_cpu_q${QB}.txt
# Config 3: NVMe offload with KV cache offloading, batch size 32
BSZ=32
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
--batch-size ${BSZ} --disk-offload --gen-len 32 --pin-memory 0 \
--offload-dir ${OFFLOAD_DIR} --kv-offload \
&> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_disk_kv.txt
# Config 4: CPU offload with quantization + KV offload, batch size 24
BSZ=24
LOG_DIR=$BASE_LOG_DIR/${MODEL_NAME}_bs${BSZ}
mkdir -p $LOG_DIR
deepspeed --num_gpus 1 run_model.py --dummy --model ${FULL_MODEL_NAME} \
--batch-size ${BSZ} --cpu-offload --gen-len 32 --pin-memory 0 \
--offload-dir ${OFFLOAD_DIR} --quant_bits ${QB} --kv-offload \
&> $LOG_DIR/ds_${MODEL_NAME}_bs${BSZ}_cpu_q${QB}_kv.txt
Naming Convention
Log files follow the pattern:
ds_{model}_{bs}{batch_size}_{offload_type}[_kv][_q{bits}].txt
where:
{offload_type}is one of:cpu,cpu_pin,disk,gpu_kvsuffix indicates KV cache offloading is enabled_q{bits}suffix indicates weight quantization with the specified bit width