Implementation:Deepspeedai DeepSpeed Init Inference
Overview
Concrete tool for creating a DeepSpeed inference engine with optimized kernels and tensor parallelism provided by the DeepSpeed library.
Implementation Type
Function (top-level API entry point)
Detailed Description
deepspeed.init_inference() is the main entry point for DeepSpeed inference optimization. It accepts a model and configuration (as dict, JSON path, or kwargs), creates a DeepSpeedInferenceConfig, then constructs an InferenceEngine.
The function supports four usage patterns:
- No config, no kwargs: Uses default
DeepSpeedInferenceConfig(). - Config dict or JSON path only: Uses the provided configuration.
- Kwargs only: Builds config from keyword arguments.
- Config and kwargs: Merges both; raises
ValueErroron conflicting keys.
Internally, the InferenceEngine.__init__() (lines L45-185 of engine.py) performs the following steps:
- Stores the model reference and config
- Patches
generate()if the model has one - Validates the dtype against accelerator capabilities
- Converts the model to the target dtype
- Creates tensor parallelism groups if
tp_size > 1 - Creates expert parallelism groups if MoE layers are detected
- Applies injection policy through one of three modes:
- User-specified injection policy (
injection_dict) - DeepSpeed kernel injection (
replace_with_kernel_inject=True) - Automatic tensor parallelism via
AutoTP.tp_parser()
- User-specified injection policy (
- Moves model to the current CUDA device
- Broadcasts RNG state across TP ranks for determinism
- Validates CUDA graph compatibility
Code Reference
- Repository: https://github.com/deepspeedai/DeepSpeed
- File:
deepspeed/__init__.py(L313-388) anddeepspeed/inference/engine.py(L45-185) - Signature:
def init_inference(model: torch.nn.Module, config=None, **kwargs) -> InferenceEngine - Import:
import deepspeed
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| model | torch.nn.Module | Yes | — | The pretrained PyTorch model to optimize for inference |
| config | Union[str, Dict, None] | No | None | Path to JSON config file or config dictionary |
| tensor_parallel | Dict | No | {} | Tensor parallelism config (e.g., {"tp_size": 4})
|
| dtype | torch.dtype | No | torch.float16 | Target data type for inference |
| replace_with_kernel_inject | bool | No | False | Enable fused CUDA kernel injection |
| enable_cuda_graph | bool | No | False | Enable CUDA graph capture and replay |
| injection_policy | Dict | No | None | Custom injection policy mapping layer classes to projection names |
| max_out_tokens | int | No | 1024 | Maximum sequence length (input + output tokens) |
| checkpoint | str / Dict | No | None | Path to DeepSpeed checkpoint for weight loading |
I/O
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | model | torch.nn.Module | Pretrained PyTorch model (e.g., from HuggingFace) |
| Input | config | Union[str, Dict, None] | Optional configuration dictionary or JSON path |
| Input | **kwargs | keyword arguments | Override configuration values |
| Output | engine | InferenceEngine | DeepSpeed InferenceEngine wrapping the optimized model |
Usage Example
import deepspeed
import torch
from transformers import AutoModelForCausalLM
# Load the pretrained model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16
)
# Case 1: Basic kernel injection
engine = deepspeed.init_inference(
model,
dtype=torch.float16,
replace_with_kernel_inject=True
)
# Case 2: Tensor parallelism across 4 GPUs
engine = deepspeed.init_inference(
model,
tensor_parallel={"tp_size": 4},
dtype=torch.float16,
replace_with_kernel_inject=True
)
# Case 3: Using a config dictionary
config = {
"dtype": "fp16",
"tensor_parallel": {"tp_size": 2},
"enable_cuda_graph": False,
"max_out_tokens": 2048
}
engine = deepspeed.init_inference(model, config=config)
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/inference-tutorial/
- https://www.deepspeed.ai/inference/
- https://arxiv.org/abs/2207.00032
Relationships
Principle:Deepspeedai_DeepSpeed_Inference_Engine_Init
Metadata
- Workflow: Inference_Engine_Optimization
- Type: Implementation
- Last Updated: 2026-02-09 00:00 GMT