Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepspeedai DeepSpeed Init Inference

From Leeroopedia


Overview

Concrete tool for creating a DeepSpeed inference engine with optimized kernels and tensor parallelism provided by the DeepSpeed library.

Implementation Type

Function (top-level API entry point)

Detailed Description

deepspeed.init_inference() is the main entry point for DeepSpeed inference optimization. It accepts a model and configuration (as dict, JSON path, or kwargs), creates a DeepSpeedInferenceConfig, then constructs an InferenceEngine.

The function supports four usage patterns:

  1. No config, no kwargs: Uses default DeepSpeedInferenceConfig().
  2. Config dict or JSON path only: Uses the provided configuration.
  3. Kwargs only: Builds config from keyword arguments.
  4. Config and kwargs: Merges both; raises ValueError on conflicting keys.

Internally, the InferenceEngine.__init__() (lines L45-185 of engine.py) performs the following steps:

  • Stores the model reference and config
  • Patches generate() if the model has one
  • Validates the dtype against accelerator capabilities
  • Converts the model to the target dtype
  • Creates tensor parallelism groups if tp_size > 1
  • Creates expert parallelism groups if MoE layers are detected
  • Applies injection policy through one of three modes:
    • User-specified injection policy (injection_dict)
    • DeepSpeed kernel injection (replace_with_kernel_inject=True)
    • Automatic tensor parallelism via AutoTP.tp_parser()
  • Moves model to the current CUDA device
  • Broadcasts RNG state across TP ranks for determinism
  • Validates CUDA graph compatibility

Code Reference

  • Repository: https://github.com/deepspeedai/DeepSpeed
  • File: deepspeed/__init__.py (L313-388) and deepspeed/inference/engine.py (L45-185)
  • Signature: def init_inference(model: torch.nn.Module, config=None, **kwargs) -> InferenceEngine
  • Import: import deepspeed

Parameters

Parameter Type Required Default Description
model torch.nn.Module Yes The pretrained PyTorch model to optimize for inference
config Union[str, Dict, None] No None Path to JSON config file or config dictionary
tensor_parallel Dict No {} Tensor parallelism config (e.g., {"tp_size": 4})
dtype torch.dtype No torch.float16 Target data type for inference
replace_with_kernel_inject bool No False Enable fused CUDA kernel injection
enable_cuda_graph bool No False Enable CUDA graph capture and replay
injection_policy Dict No None Custom injection policy mapping layer classes to projection names
max_out_tokens int No 1024 Maximum sequence length (input + output tokens)
checkpoint str / Dict No None Path to DeepSpeed checkpoint for weight loading

I/O

Direction Name Type Description
Input model torch.nn.Module Pretrained PyTorch model (e.g., from HuggingFace)
Input config Union[str, Dict, None] Optional configuration dictionary or JSON path
Input **kwargs keyword arguments Override configuration values
Output engine InferenceEngine DeepSpeed InferenceEngine wrapping the optimized model

Usage Example

import deepspeed
import torch
from transformers import AutoModelForCausalLM

# Load the pretrained model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16
)

# Case 1: Basic kernel injection
engine = deepspeed.init_inference(
    model,
    dtype=torch.float16,
    replace_with_kernel_inject=True
)

# Case 2: Tensor parallelism across 4 GPUs
engine = deepspeed.init_inference(
    model,
    tensor_parallel={"tp_size": 4},
    dtype=torch.float16,
    replace_with_kernel_inject=True
)

# Case 3: Using a config dictionary
config = {
    "dtype": "fp16",
    "tensor_parallel": {"tp_size": 2},
    "enable_cuda_graph": False,
    "max_out_tokens": 2048
}
engine = deepspeed.init_inference(model, config=config)

Knowledge Sources

Relationships

Principle:Deepspeedai_DeepSpeed_Inference_Engine_Init

Metadata

  • Workflow: Inference_Engine_Optimization
  • Type: Implementation
  • Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment