Principle:Pytorch Serve DeepSpeed Inference
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | DeepSpeed Inference |
| Domains | Distributed_Computing, Model_Serving |
| Knowledge Sources | TorchServe |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Tensor parallelism with DeepSpeed Inference enables serving large transformer models by sharding individual weight matrices across multiple GPUs. Unlike pipeline parallelism, which splits the model along the layer dimension, tensor parallelism splits each layer's parameters across GPUs so that all GPUs work on the same layer simultaneously. DeepSpeed Inference provides optimized tensor-parallel kernels and automatic model sharding for transformer architectures, making it well-suited for serving large language models in TorchServe.
Description
Tensor parallelism operates by partitioning the weight matrices of each transformer layer across multiple GPUs. For a model with hidden dimension H served on K GPUs, each GPU holds a weight slice of size H/K (for column-parallel layers) or H/K rows (for row-parallel layers).
How tensor parallelism works in a transformer layer:
For a self-attention layer with weight matrix W of shape (H, H):
- The weight matrix is split column-wise into K shards, each of shape (H, H/K).
- Each GPU computes its portion of the output: Y_k = X * W_k.
- An all-reduce operation combines the partial results to produce the full output.
This approach requires synchronization (all-reduce) at each layer boundary, but all GPUs are active on every token, providing lower latency per token than pipeline parallelism.
DeepSpeed Inference integration in TorchServe:
- The
get_ds_engine()function reads DeepSpeed configuration fromctx.model_yaml_config["deepspeed"]. - A DeepSpeed config JSON file specifies the data type, tensor parallel size, and whether to use kernel injection.
deepspeed.init_inference()is called to create an inference engine that automatically shards the model.- The engine replaces standard transformer layers with DeepSpeed-optimized versions when
replace_with_kernel_inject: trueis set. - TorchServe uses
torchrun(withparallelType: "tp") to launch one process per GPU.
Kernel injection is a key DeepSpeed optimization where standard PyTorch transformer layers are replaced with fused CUDA kernels that combine multiple operations (layer norm, QKV projection, attention, and output projection) into fewer kernel launches, reducing GPU overhead.
Usage
To use DeepSpeed tensor parallelism in TorchServe:
- Create a custom handler that inherits from
BaseDeepSpeedHandler. - In
initialize(), load the model, callget_ds_engine(model, ctx), and assignds_engine.moduletoself.model. - Create a DeepSpeed config JSON file (e.g.,
ds-config.json) specifyingdtype,tensor_parallel.tp_size, andreplace_with_kernel_inject. - Configure
model-config.yamlwithparallelType: "tp"and reference the DeepSpeed config file. - Package and deploy via
torch-model-archiver.
Pre-installing DeepSpeed with DS_BUILD_OPS=1 pip install deepspeed is recommended to reduce model loading latency by avoiding JIT compilation of custom CUDA kernels at runtime.
Theoretical Basis
Tensor parallelism is based on the mathematical property that matrix multiplication can be decomposed across one dimension. Given a matrix multiplication Y = X * W:
Column parallelism: W is split column-wise into [W_1, W_2, ..., W_K]. Each GPU k computes Y_k = X * W_k. The results are concatenated: Y = [Y_1, Y_2, ..., Y_K]. This is used for the first linear layer in feed-forward networks and the QKV projections in attention.
Row parallelism: W is split row-wise. The input X must also be partitioned. Each GPU computes a partial result, and an all-reduce sum produces the final output. This is used for the output projection in attention and the second linear layer in feed-forward networks.
The combination of column and row parallelism in alternating layers allows a pair of linear layers (common in transformers) to require only a single all-reduce operation, minimizing communication overhead.
The communication cost scales as O(H) per layer per token, where H is the hidden dimension. For K GPUs, each all-reduce communicates 2*(K-1)/K * H elements. This makes tensor parallelism most effective when GPUs are connected by high-bandwidth links (e.g., NVLink at 600 GB/s between GPUs on the same node).
Kernel injection further optimizes inference by fusing multiple operations into single CUDA kernels, reducing kernel launch overhead and memory bandwidth consumption. DeepSpeed provides pre-built fused kernels for common transformer architectures.
Related Pages
- Implementation:Pytorch_Serve_BaseDeepSpeedHandler - DeepSpeed handler base class and engine initialization
- Pytorch_Serve_Parallelism_Strategy - Choosing between parallelism strategies
- Pytorch_Serve_Distributed_Configuration - Configuring distributed serving parameters
- Pytorch_Serve_Distributed_Worker - Managing worker processes for distributed inference