Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed InferenceEngine Forward

From Leeroopedia


Overview

Concrete tool for executing optimized inference forward passes and generation provided by the DeepSpeed library.

Implementation Type

Method (instance methods of InferenceEngine)

Detailed Description

InferenceEngine.forward() executes the optimized model forward pass. If CUDA graphs are enabled, it captures the graph on first invocation and replays it on subsequent calls. InferenceEngine._generate() delegates to HuggingFace's generate() method with DeepSpeed optimizations active.

Forward pass execution flow (lines L556-583):

  1. If profiling is enabled and CUDA graphs are active, synchronize the GPU and record a wall-clock start time.
  2. If CUDA graphs are enabled and not using local CUDA graphs:
    • If a graph has already been captured (cuda_graph_created), call _graph_replay().
    • Otherwise, call _create_cuda_graph() to capture the graph (with 3 warmup iterations), then replay it.
  3. If CUDA graphs are not enabled, call self.module(*inputs, **kwargs) directly.
  4. If profiling is enabled with CUDA graphs, synchronize and record the elapsed time.

CUDA graph creation (lines L496-513):

  1. Create a new CUDA stream and wait for the current stream.
  2. Run 3 warmup iterations on the new stream to initialize workspaces and cuBLAS handles.
  3. Synchronize streams.
  4. Create a CUDA graph object and capture the forward pass into it.
  5. Store static input/output references for replay.

CUDA graph replay (lines L515-523):

  1. Copy new input tensors into the static input buffers.
  2. Copy new keyword argument tensors into static keyword buffers.
  3. Replay the captured CUDA graph.
  4. Return the static output reference.

Generation (lines L585-608):

  1. Reset KV-cache if the model supports it.
  2. Check num_beams and raise NotImplementedError if greater than 1.
  3. Validate that input token lengths do not exceed max_out_tokens.
  4. Delegate to self.module.generate(*inputs, **kwargs).

Code Reference

  • Repository: https://github.com/deepspeedai/DeepSpeed
  • File: deepspeed/inference/engine.py
  • Lines: L556-583 (forward), L585-608 (_generate), L496-523 (CUDA graph helpers)
  • Signatures:
    • def forward(self, *inputs, **kwargs) -> Any
    • def _generate(self, *inputs, **kwargs) -> torch.Tensor
  • Import: Accessed via the InferenceEngine returned by deepspeed.init_inference()

Parameters

Method Parameter Type Required Description
forward *inputs Variable positional args Yes Model input tensors (e.g., input_ids, attention_mask)
forward **kwargs Variable keyword args No Additional model arguments (e.g., input_ids=tokens)
_generate *inputs Variable positional args No Positional inputs for generation
_generate **kwargs Variable keyword args No Generation parameters (e.g., max_new_tokens, do_sample, temperature)

I/O

Direction Name Type Description
Input *inputs Tensors Model input tensors (input_ids, attention_mask, etc.)
Input **kwargs keyword arguments Additional model/generation parameters
Output (forward) outputs Model-dependent Model logits or output tuple depending on return_tuple config
Output (generate) generated torch.Tensor Generated token sequences

Usage Example

import deepspeed
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Setup
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16
)
engine = deepspeed.init_inference(
    model,
    dtype=torch.float16,
    replace_with_kernel_inject=True
)

# Forward pass (returns logits)
tokens = tokenizer("DeepSpeed is", return_tensors="pt").to("cuda")
outputs = engine(**tokens)
logits = outputs.logits

# Text generation
generated = engine.generate(
    input_ids=tokens["input_ids"],
    max_new_tokens=100,
    do_sample=False
)
text = tokenizer.decode(generated[0], skip_special_tokens=True)
print(text)

Knowledge Sources

Relationships

Principle:Deepspeedai_DeepSpeed_Inference_Execution

Metadata

  • Workflow: Inference_Engine_Optimization
  • Type: Implementation
  • Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment