Implementation:Deepspeedai DeepSpeed Init Inference

Overview

Concrete tool for creating a DeepSpeed inference engine with optimized kernels and tensor parallelism provided by the DeepSpeed library.

Implementation Type

Function (top-level API entry point)

Detailed Description

deepspeed.init_inference() is the main entry point for DeepSpeed inference optimization. It accepts a model and configuration (as dict, JSON path, or kwargs), creates a DeepSpeedInferenceConfig, then constructs an InferenceEngine.

The function supports four usage patterns:

No config, no kwargs: Uses default DeepSpeedInferenceConfig().
Config dict or JSON path only: Uses the provided configuration.
Kwargs only: Builds config from keyword arguments.
Config and kwargs: Merges both; raises ValueError on conflicting keys.

Internally, the InferenceEngine.__init__() (lines L45-185 of engine.py) performs the following steps:

Stores the model reference and config
Patches generate() if the model has one
Validates the dtype against accelerator capabilities
Converts the model to the target dtype
Creates tensor parallelism groups if tp_size > 1
Creates expert parallelism groups if MoE layers are detected
Applies injection policy through one of three modes:
- User-specified injection policy (injection_dict)
- DeepSpeed kernel injection (replace_with_kernel_inject=True)
- Automatic tensor parallelism via AutoTP.tp_parser()
Moves model to the current CUDA device
Broadcasts RNG state across TP ranks for determinism
Validates CUDA graph compatibility

Code Reference

Repository: https://github.com/deepspeedai/DeepSpeed
File: deepspeed/__init__.py (L313-388) and deepspeed/inference/engine.py (L45-185)
Signature: def init_inference(model: torch.nn.Module, config=None, **kwargs) -> InferenceEngine
Import: import deepspeed

Parameters

Parameter	Type	Required	Default	Description
model	torch.nn.Module	Yes	—	The pretrained PyTorch model to optimize for inference
config	Union[str, Dict, None]	No	None	Path to JSON config file or config dictionary
tensor_parallel	Dict	No	{}	Tensor parallelism config (e.g., `{"tp_size": 4}`)
dtype	torch.dtype	No	torch.float16	Target data type for inference
replace_with_kernel_inject	bool	No	False	Enable fused CUDA kernel injection
enable_cuda_graph	bool	No	False	Enable CUDA graph capture and replay
injection_policy	Dict	No	None	Custom injection policy mapping layer classes to projection names
max_out_tokens	int	No	1024	Maximum sequence length (input + output tokens)
checkpoint	str / Dict	No	None	Path to DeepSpeed checkpoint for weight loading

I/O

Direction	Name	Type	Description
Input	model	torch.nn.Module	Pretrained PyTorch model (e.g., from HuggingFace)
Input	config	Union[str, Dict, None]	Optional configuration dictionary or JSON path
Input	**kwargs	keyword arguments	Override configuration values
Output	engine	InferenceEngine	DeepSpeed InferenceEngine wrapping the optimized model

Usage Example

import deepspeed
import torch
from transformers import AutoModelForCausalLM

# Load the pretrained model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16
)

# Case 1: Basic kernel injection
engine = deepspeed.init_inference(
    model,
    dtype=torch.float16,
    replace_with_kernel_inject=True
)

# Case 2: Tensor parallelism across 4 GPUs
engine = deepspeed.init_inference(
    model,
    tensor_parallel={"tp_size": 4},
    dtype=torch.float16,
    replace_with_kernel_inject=True
)

# Case 3: Using a config dictionary
config = {
    "dtype": "fp16",
    "tensor_parallel": {"tp_size": 2},
    "enable_cuda_graph": False,
    "max_out_tokens": 2048
}
engine = deepspeed.init_inference(model, config=config)

Knowledge Sources

Relationships

Principle:Deepspeedai_DeepSpeed_Inference_Engine_Init

Environment:Deepspeedai_DeepSpeed_CUDA_GPU_Environment

Metadata

Workflow: Inference_Engine_Optimization
Type: Implementation
Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment