Implementation:Microsoft DeepSpeedExamples Text Generation Test

Knowledge Sources	Microsoft_DeepSpeedExamples
Domains	NLP, Text Generation
Last Updated	2026-02-07 12:00 GMT

Overview

Adapted HuggingFace text generation script supporting conditional auto-regressive generation with optional DeepSpeed inference integration and latency benchmarking.

Description

test-run-generation.py is a command-line script for conditional text generation using multiple auto-regressive language models from the HuggingFace Transformers library. It supports GPT-2, GPT-Neo, CTRL, OpenAI-GPT, XLNet, Transformer-XL, and XLM model families through a unified MODEL_CLASSES registry that maps model type strings to their corresponding model and tokenizer classes.

The script integrates DeepSpeed inference via the --ds-inference flag, which initializes the model with deepspeed.init_inference() using GPT-2 transformer layer injection policy and optional kernel replacement. This enables optimized inference with tensor parallelism and custom CUDA kernels. The script also supports FP16 inference via the --fp16 flag.

A key feature is the built-in latency benchmarking through the print_latency() function, which collects per-token generation latencies across all prompts (skipping the first 10 as warmup) and reports average, P50, P90, P95, P99, and P999 percentile latencies. The script supports both interactive single-prompt and batch file input modes via --sample_input.

Usage

Use this script to benchmark and test text generation with various HuggingFace language models, optionally accelerated with DeepSpeed inference. It is particularly useful for comparing baseline versus DeepSpeed-accelerated inference latency.

Code Reference

Source Location

Repository: Microsoft_DeepSpeedExamples
File: inference/huggingface/text-generation/run-generation-script/test-run-generation.py
Lines: 1-350

Signature

def main():
    ...

def set_seed(args):
    ...

def adjust_length_to_model(length, max_sequence_length):
    ...

def print_latency(latency_set, title=""):
    ...

def prepare_ctrl_input(args, _, tokenizer, prompt_text):
    ...

def prepare_xlm_input(args, model, tokenizer, prompt_text):
    ...

def prepare_xlnet_input(args, _, tokenizer, prompt_text):
    ...

def prepare_transfoxl_input(args, _, tokenizer, prompt_text):
    ...

Import

# This is a standalone script, run directly:
# python test-run-generation.py --model_type gpt2 --model_name_or_path gpt2

I/O Contract

Inputs

Name	Type	Required	Description
--model_type	str	Yes	Model architecture type: gpt2, gptneo, ctrl, openai-gpt, xlnet, transfo-xl, xlm
--model_name_or_path	str	Yes	Path to pretrained model or HuggingFace model name
--prompt	str	No	Text prompt for generation (interactive input if omitted)
--sample_input	str	No	Path to file containing multiple prompts (one per line)
--length	int	No	Maximum generation length (default: 20)
--temperature	float	No	Sampling temperature (default: 1.0)
--k	int	No	Top-k filtering value (default: 0)
--p	float	No	Top-p (nucleus) filtering value (default: 0.9)
--ds-inference	flag	No	Enable DeepSpeed inference optimization
--fp16	flag	No	Enable FP16 half-precision inference
--seed	int	No	Random seed for reproducibility (default: 42)

Outputs

Name	Type	Description
generated_sequences	List[str]	List of generated text sequences printed to stdout
latency_stats	stdout	Percentile latency statistics (avg, P50, P90, P95, P99, P999)

Usage Examples

GPT-2 Generation with DeepSpeed

# Run GPT-2 generation with DeepSpeed inference and FP16
python test-run-generation.py \
    --model_type gpt2 \
    --model_name_or_path gpt2-large \
    --prompt "The future of artificial intelligence" \
    --length 100 \
    --fp16 \
    --ds-inference

# Batch generation from file
python test-run-generation.py \
    --model_type gpt2 \
    --model_name_or_path gpt2 \
    --sample_input prompts.txt \
    --length 50

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment