Implementation:Microsoft DeepSpeedExamples Text Generation Test
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text Generation |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
Adapted HuggingFace text generation script supporting conditional auto-regressive generation with optional DeepSpeed inference integration and latency benchmarking.
Description
test-run-generation.py is a command-line script for conditional text generation using multiple auto-regressive language models from the HuggingFace Transformers library. It supports GPT-2, GPT-Neo, CTRL, OpenAI-GPT, XLNet, Transformer-XL, and XLM model families through a unified MODEL_CLASSES registry that maps model type strings to their corresponding model and tokenizer classes.
The script integrates DeepSpeed inference via the --ds-inference flag, which initializes the model with deepspeed.init_inference() using GPT-2 transformer layer injection policy and optional kernel replacement. This enables optimized inference with tensor parallelism and custom CUDA kernels. The script also supports FP16 inference via the --fp16 flag.
A key feature is the built-in latency benchmarking through the print_latency() function, which collects per-token generation latencies across all prompts (skipping the first 10 as warmup) and reports average, P50, P90, P95, P99, and P999 percentile latencies. The script supports both interactive single-prompt and batch file input modes via --sample_input.
Usage
Use this script to benchmark and test text generation with various HuggingFace language models, optionally accelerated with DeepSpeed inference. It is particularly useful for comparing baseline versus DeepSpeed-accelerated inference latency.
Code Reference
Source Location
- Repository: Microsoft_DeepSpeedExamples
- File:
inference/huggingface/text-generation/run-generation-script/test-run-generation.py - Lines: 1-350
Signature
def main():
...
def set_seed(args):
...
def adjust_length_to_model(length, max_sequence_length):
...
def print_latency(latency_set, title=""):
...
def prepare_ctrl_input(args, _, tokenizer, prompt_text):
...
def prepare_xlm_input(args, model, tokenizer, prompt_text):
...
def prepare_xlnet_input(args, _, tokenizer, prompt_text):
...
def prepare_transfoxl_input(args, _, tokenizer, prompt_text):
...
Import
# This is a standalone script, run directly:
# python test-run-generation.py --model_type gpt2 --model_name_or_path gpt2
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model_type | str | Yes | Model architecture type: gpt2, gptneo, ctrl, openai-gpt, xlnet, transfo-xl, xlm |
| --model_name_or_path | str | Yes | Path to pretrained model or HuggingFace model name |
| --prompt | str | No | Text prompt for generation (interactive input if omitted) |
| --sample_input | str | No | Path to file containing multiple prompts (one per line) |
| --length | int | No | Maximum generation length (default: 20) |
| --temperature | float | No | Sampling temperature (default: 1.0) |
| --k | int | No | Top-k filtering value (default: 0) |
| --p | float | No | Top-p (nucleus) filtering value (default: 0.9) |
| --ds-inference | flag | No | Enable DeepSpeed inference optimization |
| --fp16 | flag | No | Enable FP16 half-precision inference |
| --seed | int | No | Random seed for reproducibility (default: 42) |
Outputs
| Name | Type | Description |
|---|---|---|
| generated_sequences | List[str] | List of generated text sequences printed to stdout |
| latency_stats | stdout | Percentile latency statistics (avg, P50, P90, P95, P99, P999) |
Usage Examples
GPT-2 Generation with DeepSpeed
# Run GPT-2 generation with DeepSpeed inference and FP16
python test-run-generation.py \
--model_type gpt2 \
--model_name_or_path gpt2-large \
--prompt "The future of artificial intelligence" \
--length 100 \
--fp16 \
--ds-inference
# Batch generation from file
python test-run-generation.py \
--model_type gpt2 \
--model_name_or_path gpt2 \
--sample_input prompts.txt \
--length 50