Principle:Deepspeedai DeepSpeed Inference Configuration

Overview

A Pydantic-based configuration system for specifying inference optimization parameters including kernel injection, tensor parallelism, quantization, and CUDA graph capture.

Detailed Description

DeepSpeed inference configuration controls how a pretrained model is optimized for inference. The configuration system uses Pydantic models to validate all parameters at construction time, providing type safety, sensible defaults, and backward-compatible aliases.

Key configuration decisions include:

Kernel Injection (replace_with_kernel_inject): Whether to replace standard PyTorch attention and MLP layers with fused CUDA kernels. These kernels combine multiple operations (e.g., QKV projection, softmax, attention output) into single GPU kernel launches, reducing memory traffic and launch overhead.
Tensor Parallelism (tensor_parallel): The degree of model parallelism across GPUs. A tp_size of N distributes each transformer layer's weight matrices across N GPUs, reducing per-GPU memory by approximately N-fold.
Precision (dtype): The inference data type — torch.float16, torch.bfloat16, torch.float32, or torch.int8. Lower precision reduces memory and increases throughput at the cost of potential accuracy degradation.
CUDA Graph Capture (enable_cuda_graph): When enabled, the first forward pass is recorded as a CUDA graph that is replayed on subsequent calls, eliminating CPU-GPU synchronization overhead for fixed-shape inputs.
Injection Policy (injection_policy): A dictionary mapping model layer classes to their injection policies, enabling DeepSpeed to optimize custom model architectures beyond built-in support.
Quantization (quant): Configuration for weight and activation quantization (int8), controlling number of bits, quantization groups, and symmetric/asymmetric modes.

The configuration hierarchy uses nested Pydantic models: DeepSpeedTPConfig for tensor parallelism, DeepSpeedMoEConfig for mixture-of-experts, and QuantizationConfig for quantization. Many fields support aliases for backward compatibility (e.g., kernel_inject for replace_with_kernel_inject, tp for tensor_parallel).

Theoretical Basis

The inference optimization configuration space represents a set of composable transformations applied to a PyTorch model:

Kernel injection replaces generic PyTorch operations with fused CUDA kernels, reducing memory traffic (fewer intermediate tensor materializations) and kernel launch overhead (fewer individual GPU kernel launches).
Tensor parallelism distributes model parameters across GPUs using column-parallel and row-parallel partitioning of weight matrices, following the Megatron-LM approach. This reduces per-GPU memory proportionally to the parallelism degree.
Quantization reduces numerical precision (fp16 to int8), trading small accuracy loss for approximately 2x memory reduction and potential throughput improvement on hardware with integer compute units.
CUDA graphs capture a static execution graph on the GPU that can be replayed without CPU intervention, eliminating CPU-side overhead (kernel dispatch, memory allocation) for workloads with fixed input shapes.

The configuration system ensures these transformations are correctly composed — for example, CUDA graphs are incompatible with tensor parallelism (validated at construction time), and kernel injection requires compatible model architectures.

Knowledge Sources

Relationships

Implementation:Deepspeedai_DeepSpeed_DeepSpeedInferenceConfig_Init

Metadata

Workflow: Inference_Engine_Optimization
Type: Principle
Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment