Implementation:Microsoft DeepSpeedExamples DSPipeline Text Generation

Knowledge Sources	Microsoft_DeepSpeedExamples
Domains	Deep Learning, Text Generation, Inference
Last Updated	2026-02-07 12:00 GMT

Overview

DSPipeline is a helper class that mimics HuggingFace pipelines for text generation, supporting DeepSpeed meta-tensor initialization for fast large-model loading.

Description

The DSPipeline class provides a high-level interface for loading and running causal language models with optional DeepSpeed meta-tensor support. When is_meta is True, the model is instantiated on a meta device using deepspeed.OnDevice and AutoModelForCausalLM.from_config, which avoids allocating real memory during initialization. The class then generates a checkpoint JSON manifest (either from a pre-existing ds_inference_config.json or by scanning for .bin and .pt files) that DeepSpeed Inference uses to shard and load weights efficiently.

The class handles tokenizer setup with left-padding (setting pad_token to eos_token), device placement for both CPU and GPU, and special handling for LlamaTokenizerFast which does not accept token_type_ids in its generate call. The callable interface accepts a list of text prompts and returns decoded output strings via generate_outputs.

A companion Performance class provides a static method print_perf_stats for reporting average per-token latency, bandwidth in GB/s, and throughput in TFlops/s after stripping warmup iterations from a latency measurement set.

Usage

Use DSPipeline when running DeepSpeed Inference for text generation tasks, especially with large models (e.g., BLOOM, Llama) that benefit from meta-tensor initialization and tensor-parallel sharding. Instantiate with a model name and call directly with input prompts to generate text.

Code Reference

Source Location

Repository: Microsoft_DeepSpeedExamples
File: inference/huggingface/text-generation/utils.py
Lines: 1-147

Signature

class DSPipeline():
    def __init__(self,
                 model_name='bigscience/bloom-3b',
                 dtype=torch.float16,
                 is_meta=True,
                 device=-1,
                 checkpoint_path=None,
                 trust_remote_code=False):
    def __call__(self, inputs=["test"], num_tokens=100, do_sample=False):
    def _generate_json(self, checkpoint_path=None):
    def generate_outputs(self, inputs=["test"], num_tokens=100, do_sample=False):

class Performance():
    def print_perf_stats(latency_set, config, dtype, batch_size, warmup=3):

Import

from utils import DSPipeline, Performance

I/O Contract

Inputs

Name	Type	Required	Description
model_name	str	No	HuggingFace model identifier (default: 'bigscience/bloom-3b')
dtype	torch.dtype	No	Model precision (default: torch.float16)
is_meta	bool	No	Enable meta-tensor initialization for large models (default: True)
device	int, str, or torch.device	No	Target device; -1 for CPU, integer for GPU index (default: -1)
checkpoint_path	str	No	Local path to model checkpoint directory; None triggers snapshot_download
trust_remote_code	bool	No	Whether to trust remote code when loading model config (default: False)
inputs	list of str	No	Text prompts for generation (default: ["test"])
num_tokens	int	No	Maximum number of new tokens to generate (default: 100)
do_sample	bool	No	Enable sampling-based generation (default: False)

Outputs

Name	Type	Description
outputs	list of str	Decoded generated text strings, one per input prompt

Usage Examples

from utils import DSPipeline

# Initialize with meta tensors for BLOOM model
pipe = DSPipeline(
    model_name="bigscience/bloom-3b",
    dtype=torch.float16,
    is_meta=True,
    device=0
)

# Generate text
results = pipe(inputs=["Once upon a time"], num_tokens=50)
for text in results:
    print(text)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment