Implementation:Deepspeedai DeepSpeed DeepSpeedInferenceConfig Init

Overview

Concrete tool for configuring DeepSpeed inference optimization parameters provided by the DeepSpeed library.

Implementation Type

Configuration Model (Pydantic-based parameter validation and storage)

Detailed Description

DeepSpeedInferenceConfig is a Pydantic model that validates and stores all inference optimization settings. It inherits from DeepSpeedConfigModel and uses Pydantic's Field with aliases for backward compatibility. The config is constructed either directly by the user or internally by deepspeed.init_inference() from a dictionary or keyword arguments.

The class defines the complete set of parameters controlling inference behavior:

replace_with_kernel_inject (alias: kernel_inject): Enables fused CUDA kernel injection for supported architectures (BERT, GPT-2, GPT-Neo, GPT-J, and others).
dtype: Target data type for model conversion. Validated through DtypeEnum which accepts both torch.dtype objects and string representations.
tensor_parallel (alias: tp): Nested DeepSpeedTPConfig controlling TP size, grain size, and group objects.
enable_cuda_graph: Captures CUDA graph on first forward pass for replay on subsequent calls.
injection_policy (alias: injection_dict): Maps model layer classes to injection policies for custom architectures.
max_out_tokens (alias: max_tokens): Maximum sequence length (input + output) the inference engine can handle.
quant: Nested QuantizationConfig for int8 quantization settings.
keep_module_on_host: Keeps checkpoints on CPU to avoid OOM during loading of very large models.

The class includes field validators for type coercion (e.g., string-to-dtype conversion via DtypeEnum), backward compatibility (e.g., boolean MoE config), and dependency checking (e.g., Triton availability).

Code Reference

Repository: https://github.com/deepspeedai/DeepSpeed
File: deepspeed/inference/config.py
Lines: L118-324
Import: from deepspeed.inference.config import DeepSpeedInferenceConfig

Parameters

Parameter	Type	Required	Default	Description
replace_with_kernel_inject	bool	No	False	Inject optimized CUDA kernels for supported model architectures
dtype	torch.dtype	No	torch.float16	Target data type for model weights and computation
tensor_parallel	DeepSpeedTPConfig / dict	No	{} (tp_size=1)	Tensor parallelism configuration (tp_size, tp_grain_size, mpu, tp_group)
enable_cuda_graph	bool	No	False	Capture CUDA graph for replay-based forward pass execution
injection_policy	Optional[Dict]	No	None	Mapping of model layer classes to injection policies
max_out_tokens	int	No	1024	Maximum total tokens (input + output) for inference
min_out_tokens	int	No	1	Minimum expected output tokens (runtime error if cannot satisfy)
quant	QuantizationConfig	No	{}	Quantization settings for int8 inference
keep_module_on_host	bool	No	False	Keep checkpoint data on host CPU to avoid device OOM
checkpoint	Optional[Union[str, Dict]]	No	None	Path to DeepSpeed-compatible checkpoint or load policy JSON
return_tuple	bool	No	True	Whether transformer layers return tuples or tensors
triangular_masking	bool	No	True	Use triangular (causal) attention masking

I/O

Direction	Name	Type	Description
Input	**kwargs	Configuration keyword arguments	Individual configuration parameters or aliases
Output	config	DeepSpeedInferenceConfig	Validated Pydantic model instance with all inference settings

Usage Example

from deepspeed.inference.config import DeepSpeedInferenceConfig
import torch

# Create inference configuration with explicit parameters
config = DeepSpeedInferenceConfig(
    replace_with_kernel_inject=True,
    dtype=torch.float16,
    tensor_parallel={"tp_size": 4},
    enable_cuda_graph=True,
    max_out_tokens=2048
)

# Access validated fields
print(config.dtype)                        # torch.float16
print(config.tensor_parallel.tp_size)      # 4
print(config.replace_with_kernel_inject)   # True

# Using aliases for backward compatibility
config_compat = DeepSpeedInferenceConfig(
    kernel_inject=True,
    tp={"tp_size": 2},
    max_tokens=4096
)

Knowledge Sources

Relationships

Principle:Deepspeedai_DeepSpeed_Inference_Configuration

Metadata

Workflow: Inference_Engine_Optimization
Type: Implementation
Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment