Implementation:Microsoft DeepSpeedExamples DSPipeline Text Generation
| Knowledge Sources | |
|---|---|
| Domains | Deep Learning, Text Generation, Inference |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
DSPipeline is a helper class that mimics HuggingFace pipelines for text generation, supporting DeepSpeed meta-tensor initialization for fast large-model loading.
Description
The DSPipeline class provides a high-level interface for loading and running causal language models with optional DeepSpeed meta-tensor support. When is_meta is True, the model is instantiated on a meta device using deepspeed.OnDevice and AutoModelForCausalLM.from_config, which avoids allocating real memory during initialization. The class then generates a checkpoint JSON manifest (either from a pre-existing ds_inference_config.json or by scanning for .bin and .pt files) that DeepSpeed Inference uses to shard and load weights efficiently.
The class handles tokenizer setup with left-padding (setting pad_token to eos_token), device placement for both CPU and GPU, and special handling for LlamaTokenizerFast which does not accept token_type_ids in its generate call. The callable interface accepts a list of text prompts and returns decoded output strings via generate_outputs.
A companion Performance class provides a static method print_perf_stats for reporting average per-token latency, bandwidth in GB/s, and throughput in TFlops/s after stripping warmup iterations from a latency measurement set.
Usage
Use DSPipeline when running DeepSpeed Inference for text generation tasks, especially with large models (e.g., BLOOM, Llama) that benefit from meta-tensor initialization and tensor-parallel sharding. Instantiate with a model name and call directly with input prompts to generate text.
Code Reference
Source Location
- Repository: Microsoft_DeepSpeedExamples
- File: inference/huggingface/text-generation/utils.py
- Lines: 1-147
Signature
class DSPipeline():
def __init__(self,
model_name='bigscience/bloom-3b',
dtype=torch.float16,
is_meta=True,
device=-1,
checkpoint_path=None,
trust_remote_code=False):
def __call__(self, inputs=["test"], num_tokens=100, do_sample=False):
def _generate_json(self, checkpoint_path=None):
def generate_outputs(self, inputs=["test"], num_tokens=100, do_sample=False):
class Performance():
def print_perf_stats(latency_set, config, dtype, batch_size, warmup=3):
Import
from utils import DSPipeline, Performance
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name | str | No | HuggingFace model identifier (default: 'bigscience/bloom-3b') |
| dtype | torch.dtype | No | Model precision (default: torch.float16) |
| is_meta | bool | No | Enable meta-tensor initialization for large models (default: True) |
| device | int, str, or torch.device | No | Target device; -1 for CPU, integer for GPU index (default: -1) |
| checkpoint_path | str | No | Local path to model checkpoint directory; None triggers snapshot_download |
| trust_remote_code | bool | No | Whether to trust remote code when loading model config (default: False) |
| inputs | list of str | No | Text prompts for generation (default: ["test"]) |
| num_tokens | int | No | Maximum number of new tokens to generate (default: 100) |
| do_sample | bool | No | Enable sampling-based generation (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| outputs | list of str | Decoded generated text strings, one per input prompt |
Usage Examples
from utils import DSPipeline
# Initialize with meta tensors for BLOOM model
pipe = DSPipeline(
model_name="bigscience/bloom-3b",
dtype=torch.float16,
is_meta=True,
device=0
)
# Generate text
results = pipe(inputs=["Once upon a time"], num_tokens=50)
for text in results:
print(text)