Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples DSPipeline Text Generation

From Leeroopedia


Knowledge Sources
Domains Deep Learning, Text Generation, Inference
Last Updated 2026-02-07 12:00 GMT

Overview

DSPipeline is a helper class that mimics HuggingFace pipelines for text generation, supporting DeepSpeed meta-tensor initialization for fast large-model loading.

Description

The DSPipeline class provides a high-level interface for loading and running causal language models with optional DeepSpeed meta-tensor support. When is_meta is True, the model is instantiated on a meta device using deepspeed.OnDevice and AutoModelForCausalLM.from_config, which avoids allocating real memory during initialization. The class then generates a checkpoint JSON manifest (either from a pre-existing ds_inference_config.json or by scanning for .bin and .pt files) that DeepSpeed Inference uses to shard and load weights efficiently.

The class handles tokenizer setup with left-padding (setting pad_token to eos_token), device placement for both CPU and GPU, and special handling for LlamaTokenizerFast which does not accept token_type_ids in its generate call. The callable interface accepts a list of text prompts and returns decoded output strings via generate_outputs.

A companion Performance class provides a static method print_perf_stats for reporting average per-token latency, bandwidth in GB/s, and throughput in TFlops/s after stripping warmup iterations from a latency measurement set.

Usage

Use DSPipeline when running DeepSpeed Inference for text generation tasks, especially with large models (e.g., BLOOM, Llama) that benefit from meta-tensor initialization and tensor-parallel sharding. Instantiate with a model name and call directly with input prompts to generate text.

Code Reference

Source Location

Signature

class DSPipeline():
    def __init__(self,
                 model_name='bigscience/bloom-3b',
                 dtype=torch.float16,
                 is_meta=True,
                 device=-1,
                 checkpoint_path=None,
                 trust_remote_code=False):
    def __call__(self, inputs=["test"], num_tokens=100, do_sample=False):
    def _generate_json(self, checkpoint_path=None):
    def generate_outputs(self, inputs=["test"], num_tokens=100, do_sample=False):

class Performance():
    def print_perf_stats(latency_set, config, dtype, batch_size, warmup=3):

Import

from utils import DSPipeline, Performance

I/O Contract

Inputs

Name Type Required Description
model_name str No HuggingFace model identifier (default: 'bigscience/bloom-3b')
dtype torch.dtype No Model precision (default: torch.float16)
is_meta bool No Enable meta-tensor initialization for large models (default: True)
device int, str, or torch.device No Target device; -1 for CPU, integer for GPU index (default: -1)
checkpoint_path str No Local path to model checkpoint directory; None triggers snapshot_download
trust_remote_code bool No Whether to trust remote code when loading model config (default: False)
inputs list of str No Text prompts for generation (default: ["test"])
num_tokens int No Maximum number of new tokens to generate (default: 100)
do_sample bool No Enable sampling-based generation (default: False)

Outputs

Name Type Description
outputs list of str Decoded generated text strings, one per input prompt

Usage Examples

from utils import DSPipeline

# Initialize with meta tensors for BLOOM model
pipe = DSPipeline(
    model_name="bigscience/bloom-3b",
    dtype=torch.float16,
    is_meta=True,
    device=0
)

# Generate text
results = pipe(inputs=["Once upon a time"], num_tokens=50)
for text in results:
    print(text)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment