Implementation:Deepspeedai DeepSpeed InferenceEngine Forward

Overview

Concrete tool for executing optimized inference forward passes and generation provided by the DeepSpeed library.

Implementation Type

Method (instance methods of InferenceEngine)

Detailed Description

InferenceEngine.forward() executes the optimized model forward pass. If CUDA graphs are enabled, it captures the graph on first invocation and replays it on subsequent calls. InferenceEngine._generate() delegates to HuggingFace's generate() method with DeepSpeed optimizations active.

Forward pass execution flow (lines L556-583):

If profiling is enabled and CUDA graphs are active, synchronize the GPU and record a wall-clock start time.
If CUDA graphs are enabled and not using local CUDA graphs:
- If a graph has already been captured (cuda_graph_created), call _graph_replay().
- Otherwise, call _create_cuda_graph() to capture the graph (with 3 warmup iterations), then replay it.
If CUDA graphs are not enabled, call self.module(*inputs, **kwargs) directly.
If profiling is enabled with CUDA graphs, synchronize and record the elapsed time.

CUDA graph creation (lines L496-513):

Create a new CUDA stream and wait for the current stream.
Run 3 warmup iterations on the new stream to initialize workspaces and cuBLAS handles.
Synchronize streams.
Create a CUDA graph object and capture the forward pass into it.
Store static input/output references for replay.

CUDA graph replay (lines L515-523):

Copy new input tensors into the static input buffers.
Copy new keyword argument tensors into static keyword buffers.
Replay the captured CUDA graph.
Return the static output reference.

Generation (lines L585-608):

Reset KV-cache if the model supports it.
Check num_beams and raise NotImplementedError if greater than 1.
Validate that input token lengths do not exceed max_out_tokens.
Delegate to self.module.generate(*inputs, **kwargs).

Code Reference

Repository: https://github.com/deepspeedai/DeepSpeed
File: deepspeed/inference/engine.py
Lines: L556-583 (forward), L585-608 (_generate), L496-523 (CUDA graph helpers)
Signatures:
- def forward(self, *inputs, **kwargs) -> Any
- def _generate(self, *inputs, **kwargs) -> torch.Tensor
Import: Accessed via the InferenceEngine returned by deepspeed.init_inference()

Parameters

Method	Parameter	Type	Required	Description
forward	*inputs	Variable positional args	Yes	Model input tensors (e.g., `input_ids`, `attention_mask`)
forward	**kwargs	Variable keyword args	No	Additional model arguments (e.g., `input_ids=tokens`)
_generate	*inputs	Variable positional args	No	Positional inputs for generation
_generate	**kwargs	Variable keyword args	No	Generation parameters (e.g., `max_new_tokens`, `do_sample`, `temperature`)

I/O

Direction	Name	Type	Description
Input	*inputs	Tensors	Model input tensors (input_ids, attention_mask, etc.)
Input	**kwargs	keyword arguments	Additional model/generation parameters
Output (forward)	outputs	Model-dependent	Model logits or output tuple depending on `return_tuple` config
Output (generate)	generated	torch.Tensor	Generated token sequences

Usage Example

import deepspeed
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Setup
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16
)
engine = deepspeed.init_inference(
    model,
    dtype=torch.float16,
    replace_with_kernel_inject=True
)

# Forward pass (returns logits)
tokens = tokenizer("DeepSpeed is", return_tensors="pt").to("cuda")
outputs = engine(**tokens)
logits = outputs.logits

# Text generation
generated = engine.generate(
    input_ids=tokens["input_ids"],
    max_new_tokens=100,
    do_sample=False
)
text = tokenizer.decode(generated[0], skip_special_tokens=True)
print(text)

Knowledge Sources

Relationships

Principle:Deepspeedai_DeepSpeed_Inference_Execution

Metadata

Workflow: Inference_Engine_Optimization
Type: Implementation
Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment