Implementation:Deepspeedai DeepSpeed InferenceEngine Forward
Overview
Concrete tool for executing optimized inference forward passes and generation provided by the DeepSpeed library.
Implementation Type
Method (instance methods of InferenceEngine)
Detailed Description
InferenceEngine.forward() executes the optimized model forward pass. If CUDA graphs are enabled, it captures the graph on first invocation and replays it on subsequent calls. InferenceEngine._generate() delegates to HuggingFace's generate() method with DeepSpeed optimizations active.
Forward pass execution flow (lines L556-583):
- If profiling is enabled and CUDA graphs are active, synchronize the GPU and record a wall-clock start time.
- If CUDA graphs are enabled and not using local CUDA graphs:
- If a graph has already been captured (
cuda_graph_created), call_graph_replay(). - Otherwise, call
_create_cuda_graph()to capture the graph (with 3 warmup iterations), then replay it.
- If a graph has already been captured (
- If CUDA graphs are not enabled, call
self.module(*inputs, **kwargs)directly. - If profiling is enabled with CUDA graphs, synchronize and record the elapsed time.
CUDA graph creation (lines L496-513):
- Create a new CUDA stream and wait for the current stream.
- Run 3 warmup iterations on the new stream to initialize workspaces and cuBLAS handles.
- Synchronize streams.
- Create a CUDA graph object and capture the forward pass into it.
- Store static input/output references for replay.
CUDA graph replay (lines L515-523):
- Copy new input tensors into the static input buffers.
- Copy new keyword argument tensors into static keyword buffers.
- Replay the captured CUDA graph.
- Return the static output reference.
Generation (lines L585-608):
- Reset KV-cache if the model supports it.
- Check
num_beamsand raiseNotImplementedErrorif greater than 1. - Validate that input token lengths do not exceed
max_out_tokens. - Delegate to
self.module.generate(*inputs, **kwargs).
Code Reference
- Repository: https://github.com/deepspeedai/DeepSpeed
- File:
deepspeed/inference/engine.py - Lines: L556-583 (
forward), L585-608 (_generate), L496-523 (CUDA graph helpers) - Signatures:
def forward(self, *inputs, **kwargs) -> Anydef _generate(self, *inputs, **kwargs) -> torch.Tensor
- Import: Accessed via the
InferenceEnginereturned bydeepspeed.init_inference()
Parameters
| Method | Parameter | Type | Required | Description |
|---|---|---|---|---|
| forward | *inputs | Variable positional args | Yes | Model input tensors (e.g., input_ids, attention_mask)
|
| forward | **kwargs | Variable keyword args | No | Additional model arguments (e.g., input_ids=tokens)
|
| _generate | *inputs | Variable positional args | No | Positional inputs for generation |
| _generate | **kwargs | Variable keyword args | No | Generation parameters (e.g., max_new_tokens, do_sample, temperature)
|
I/O
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | *inputs | Tensors | Model input tensors (input_ids, attention_mask, etc.) |
| Input | **kwargs | keyword arguments | Additional model/generation parameters |
| Output (forward) | outputs | Model-dependent | Model logits or output tuple depending on return_tuple config
|
| Output (generate) | generated | torch.Tensor | Generated token sequences |
Usage Example
import deepspeed
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Setup
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16
)
engine = deepspeed.init_inference(
model,
dtype=torch.float16,
replace_with_kernel_inject=True
)
# Forward pass (returns logits)
tokens = tokenizer("DeepSpeed is", return_tensors="pt").to("cuda")
outputs = engine(**tokens)
logits = outputs.logits
# Text generation
generated = engine.generate(
input_ids=tokens["input_ids"],
max_new_tokens=100,
do_sample=False
)
text = tokenizer.decode(generated[0], skip_special_tokens=True)
print(text)
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/inference-tutorial/
- https://developer.nvidia.com/blog/cuda-graphs/
Relationships
Principle:Deepspeedai_DeepSpeed_Inference_Execution
Metadata
- Workflow: Inference_Engine_Optimization
- Type: Implementation
- Last Updated: 2026-02-09 00:00 GMT