Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Pytorch Serve DeepSpeed Inference

From Leeroopedia
Field Value
Page Type Principle
Title DeepSpeed Inference
Domains Distributed_Computing, Model_Serving
Knowledge Sources TorchServe
Last Updated 2026-02-13 00:00 GMT

Overview

Tensor parallelism with DeepSpeed Inference enables serving large transformer models by sharding individual weight matrices across multiple GPUs. Unlike pipeline parallelism, which splits the model along the layer dimension, tensor parallelism splits each layer's parameters across GPUs so that all GPUs work on the same layer simultaneously. DeepSpeed Inference provides optimized tensor-parallel kernels and automatic model sharding for transformer architectures, making it well-suited for serving large language models in TorchServe.

Description

Tensor parallelism operates by partitioning the weight matrices of each transformer layer across multiple GPUs. For a model with hidden dimension H served on K GPUs, each GPU holds a weight slice of size H/K (for column-parallel layers) or H/K rows (for row-parallel layers).

How tensor parallelism works in a transformer layer:

For a self-attention layer with weight matrix W of shape (H, H):

  1. The weight matrix is split column-wise into K shards, each of shape (H, H/K).
  2. Each GPU computes its portion of the output: Y_k = X * W_k.
  3. An all-reduce operation combines the partial results to produce the full output.

This approach requires synchronization (all-reduce) at each layer boundary, but all GPUs are active on every token, providing lower latency per token than pipeline parallelism.

DeepSpeed Inference integration in TorchServe:

  • The get_ds_engine() function reads DeepSpeed configuration from ctx.model_yaml_config["deepspeed"].
  • A DeepSpeed config JSON file specifies the data type, tensor parallel size, and whether to use kernel injection.
  • deepspeed.init_inference() is called to create an inference engine that automatically shards the model.
  • The engine replaces standard transformer layers with DeepSpeed-optimized versions when replace_with_kernel_inject: true is set.
  • TorchServe uses torchrun (with parallelType: "tp") to launch one process per GPU.

Kernel injection is a key DeepSpeed optimization where standard PyTorch transformer layers are replaced with fused CUDA kernels that combine multiple operations (layer norm, QKV projection, attention, and output projection) into fewer kernel launches, reducing GPU overhead.

Usage

To use DeepSpeed tensor parallelism in TorchServe:

  1. Create a custom handler that inherits from BaseDeepSpeedHandler.
  2. In initialize(), load the model, call get_ds_engine(model, ctx), and assign ds_engine.module to self.model.
  3. Create a DeepSpeed config JSON file (e.g., ds-config.json) specifying dtype, tensor_parallel.tp_size, and replace_with_kernel_inject.
  4. Configure model-config.yaml with parallelType: "tp" and reference the DeepSpeed config file.
  5. Package and deploy via torch-model-archiver.

Pre-installing DeepSpeed with DS_BUILD_OPS=1 pip install deepspeed is recommended to reduce model loading latency by avoiding JIT compilation of custom CUDA kernels at runtime.

Theoretical Basis

Tensor parallelism is based on the mathematical property that matrix multiplication can be decomposed across one dimension. Given a matrix multiplication Y = X * W:

Column parallelism: W is split column-wise into [W_1, W_2, ..., W_K]. Each GPU k computes Y_k = X * W_k. The results are concatenated: Y = [Y_1, Y_2, ..., Y_K]. This is used for the first linear layer in feed-forward networks and the QKV projections in attention.

Row parallelism: W is split row-wise. The input X must also be partitioned. Each GPU computes a partial result, and an all-reduce sum produces the final output. This is used for the output projection in attention and the second linear layer in feed-forward networks.

The combination of column and row parallelism in alternating layers allows a pair of linear layers (common in transformers) to require only a single all-reduce operation, minimizing communication overhead.

The communication cost scales as O(H) per layer per token, where H is the hidden dimension. For K GPUs, each all-reduce communicates 2*(K-1)/K * H elements. This makes tensor parallelism most effective when GPUs are connected by high-bandwidth links (e.g., NVLink at 600 GB/s between GPUs on the same node).

Kernel injection further optimizes inference by fusing multiple operations into single CUDA kernels, reducing kernel launch overhead and memory bandwidth consumption. DeepSpeed provides pre-built fused kernels for common transformer architectures.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment