Implementation:Deepspeedai DeepSpeed DeepSpeedInferenceConfig Init
Overview
Concrete tool for configuring DeepSpeed inference optimization parameters provided by the DeepSpeed library.
Implementation Type
Configuration Model (Pydantic-based parameter validation and storage)
Detailed Description
DeepSpeedInferenceConfig is a Pydantic model that validates and stores all inference optimization settings. It inherits from DeepSpeedConfigModel and uses Pydantic's Field with aliases for backward compatibility. The config is constructed either directly by the user or internally by deepspeed.init_inference() from a dictionary or keyword arguments.
The class defines the complete set of parameters controlling inference behavior:
- replace_with_kernel_inject (alias:
kernel_inject): Enables fused CUDA kernel injection for supported architectures (BERT, GPT-2, GPT-Neo, GPT-J, and others). - dtype: Target data type for model conversion. Validated through
DtypeEnumwhich accepts bothtorch.dtypeobjects and string representations. - tensor_parallel (alias:
tp): NestedDeepSpeedTPConfigcontrolling TP size, grain size, and group objects. - enable_cuda_graph: Captures CUDA graph on first forward pass for replay on subsequent calls.
- injection_policy (alias:
injection_dict): Maps model layer classes to injection policies for custom architectures. - max_out_tokens (alias:
max_tokens): Maximum sequence length (input + output) the inference engine can handle. - quant: Nested
QuantizationConfigfor int8 quantization settings. - keep_module_on_host: Keeps checkpoints on CPU to avoid OOM during loading of very large models.
The class includes field validators for type coercion (e.g., string-to-dtype conversion via DtypeEnum), backward compatibility (e.g., boolean MoE config), and dependency checking (e.g., Triton availability).
Code Reference
- Repository: https://github.com/deepspeedai/DeepSpeed
- File:
deepspeed/inference/config.py - Lines: L118-324
- Import:
from deepspeed.inference.config import DeepSpeedInferenceConfig
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| replace_with_kernel_inject | bool | No | False | Inject optimized CUDA kernels for supported model architectures |
| dtype | torch.dtype | No | torch.float16 | Target data type for model weights and computation |
| tensor_parallel | DeepSpeedTPConfig / dict | No | {} (tp_size=1) | Tensor parallelism configuration (tp_size, tp_grain_size, mpu, tp_group) |
| enable_cuda_graph | bool | No | False | Capture CUDA graph for replay-based forward pass execution |
| injection_policy | Optional[Dict] | No | None | Mapping of model layer classes to injection policies |
| max_out_tokens | int | No | 1024 | Maximum total tokens (input + output) for inference |
| min_out_tokens | int | No | 1 | Minimum expected output tokens (runtime error if cannot satisfy) |
| quant | QuantizationConfig | No | {} | Quantization settings for int8 inference |
| keep_module_on_host | bool | No | False | Keep checkpoint data on host CPU to avoid device OOM |
| checkpoint | Optional[Union[str, Dict]] | No | None | Path to DeepSpeed-compatible checkpoint or load policy JSON |
| return_tuple | bool | No | True | Whether transformer layers return tuples or tensors |
| triangular_masking | bool | No | True | Use triangular (causal) attention masking |
I/O
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | **kwargs | Configuration keyword arguments | Individual configuration parameters or aliases |
| Output | config | DeepSpeedInferenceConfig | Validated Pydantic model instance with all inference settings |
Usage Example
from deepspeed.inference.config import DeepSpeedInferenceConfig
import torch
# Create inference configuration with explicit parameters
config = DeepSpeedInferenceConfig(
replace_with_kernel_inject=True,
dtype=torch.float16,
tensor_parallel={"tp_size": 4},
enable_cuda_graph=True,
max_out_tokens=2048
)
# Access validated fields
print(config.dtype) # torch.float16
print(config.tensor_parallel.tp_size) # 4
print(config.replace_with_kernel_inject) # True
# Using aliases for backward compatibility
config_compat = DeepSpeedInferenceConfig(
kernel_inject=True,
tp={"tp_size": 2},
max_tokens=4096
)
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/inference-tutorial/
- https://www.deepspeed.ai/inference/
Relationships
Principle:Deepspeedai_DeepSpeed_Inference_Configuration
Metadata
- Workflow: Inference_Engine_Optimization
- Type: Implementation
- Last Updated: 2026-02-09 00:00 GMT