Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed DeepSpeedInferenceConfig Init

From Leeroopedia


Overview

Concrete tool for configuring DeepSpeed inference optimization parameters provided by the DeepSpeed library.

Implementation Type

Configuration Model (Pydantic-based parameter validation and storage)

Detailed Description

DeepSpeedInferenceConfig is a Pydantic model that validates and stores all inference optimization settings. It inherits from DeepSpeedConfigModel and uses Pydantic's Field with aliases for backward compatibility. The config is constructed either directly by the user or internally by deepspeed.init_inference() from a dictionary or keyword arguments.

The class defines the complete set of parameters controlling inference behavior:

  • replace_with_kernel_inject (alias: kernel_inject): Enables fused CUDA kernel injection for supported architectures (BERT, GPT-2, GPT-Neo, GPT-J, and others).
  • dtype: Target data type for model conversion. Validated through DtypeEnum which accepts both torch.dtype objects and string representations.
  • tensor_parallel (alias: tp): Nested DeepSpeedTPConfig controlling TP size, grain size, and group objects.
  • enable_cuda_graph: Captures CUDA graph on first forward pass for replay on subsequent calls.
  • injection_policy (alias: injection_dict): Maps model layer classes to injection policies for custom architectures.
  • max_out_tokens (alias: max_tokens): Maximum sequence length (input + output) the inference engine can handle.
  • quant: Nested QuantizationConfig for int8 quantization settings.
  • keep_module_on_host: Keeps checkpoints on CPU to avoid OOM during loading of very large models.

The class includes field validators for type coercion (e.g., string-to-dtype conversion via DtypeEnum), backward compatibility (e.g., boolean MoE config), and dependency checking (e.g., Triton availability).

Code Reference

Parameters

Parameter Type Required Default Description
replace_with_kernel_inject bool No False Inject optimized CUDA kernels for supported model architectures
dtype torch.dtype No torch.float16 Target data type for model weights and computation
tensor_parallel DeepSpeedTPConfig / dict No {} (tp_size=1) Tensor parallelism configuration (tp_size, tp_grain_size, mpu, tp_group)
enable_cuda_graph bool No False Capture CUDA graph for replay-based forward pass execution
injection_policy Optional[Dict] No None Mapping of model layer classes to injection policies
max_out_tokens int No 1024 Maximum total tokens (input + output) for inference
min_out_tokens int No 1 Minimum expected output tokens (runtime error if cannot satisfy)
quant QuantizationConfig No {} Quantization settings for int8 inference
keep_module_on_host bool No False Keep checkpoint data on host CPU to avoid device OOM
checkpoint Optional[Union[str, Dict]] No None Path to DeepSpeed-compatible checkpoint or load policy JSON
return_tuple bool No True Whether transformer layers return tuples or tensors
triangular_masking bool No True Use triangular (causal) attention masking

I/O

Direction Name Type Description
Input **kwargs Configuration keyword arguments Individual configuration parameters or aliases
Output config DeepSpeedInferenceConfig Validated Pydantic model instance with all inference settings

Usage Example

from deepspeed.inference.config import DeepSpeedInferenceConfig
import torch

# Create inference configuration with explicit parameters
config = DeepSpeedInferenceConfig(
    replace_with_kernel_inject=True,
    dtype=torch.float16,
    tensor_parallel={"tp_size": 4},
    enable_cuda_graph=True,
    max_out_tokens=2048
)

# Access validated fields
print(config.dtype)                        # torch.float16
print(config.tensor_parallel.tp_size)      # 4
print(config.replace_with_kernel_inject)   # True

# Using aliases for backward compatibility
config_compat = DeepSpeedInferenceConfig(
    kernel_inject=True,
    tp={"tp_size": 2},
    max_tokens=4096
)

Knowledge Sources

Relationships

Principle:Deepspeedai_DeepSpeed_Inference_Configuration

Metadata

  • Workflow: Inference_Engine_Optimization
  • Type: Implementation
  • Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment