Environment:Pytorch Serve DeepSpeed Environment

Knowledge Sources	Pytorch_Serve DeepSpeed
Domains	Distributed_Inference, LLMs
Last Updated	2026-02-13 00:00 GMT

Overview

DeepSpeed inference environment for distributed large model serving with TorchServe.

Description

This environment provides the DeepSpeed library for serving large models that exceed single-GPU memory via model parallelism. The `BaseDeepSpeedHandler` uses the `LOCAL_RANK` environment variable for device assignment and delegates to DeepSpeed's inference engine for automatic model partitioning. DeepSpeed supports fp16/bf16 inference, kernel fusion, and heterogeneous memory management for models like Bloom, GPT-NeoX, and other large language models.

Usage

Use this environment when serving models that are too large for a single GPU and require DeepSpeed's tensor parallelism or inference optimizations. Required for the Large Model Inference workflow when the DeepSpeed parallelism strategy is selected.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+)	DeepSpeed has limited non-Linux support
Hardware	Multiple NVIDIA GPUs	Minimum 2 GPUs for parallelism
VRAM	16GB+ per GPU	More for larger models
Disk	50GB+	Model weights and DeepSpeed cache

Dependencies

System Packages

NVIDIA GPU driver >= 450
CUDA Toolkit >= 11.0
NCCL (for multi-GPU communication)

Python Packages

`deepspeed`
`torch` with CUDA support
`transformers` >= 4.34.0
`torchserve`

Credentials

The following environment variables must be set for distributed inference:

`LOCAL_RANK`: Local rank of the process on the node (default: 0). Used by `BaseDeepSpeedHandler` for device assignment.
`WORLD_SIZE`: Total number of processes across all nodes.
`RANK`: Global rank of the process.
`LOCAL_WORLD_SIZE`: Number of processes on the local node.

Quick Install

# Install DeepSpeed with dependencies
pip install deepspeed transformers>=4.34.0

# Install TorchServe
pip install torchserve torch-model-archiver

Code Evidence

DeepSpeed import from `ts/handler_utils/distributed/deepspeed.py:6`:

import deepspeed

Device assignment via LOCAL_RANK from `ts/torch_handler/distributed/base_deepspeed_handler.py:13-14`:

def initialize(self, ctx: Context):
    self.device = int(os.getenv("LOCAL_RANK", 0))

Worker environment variables from `ts/model_service_worker.py:23-27`:

BENCHMARK = os.getenv("TS_BENCHMARK") in ["True", "true", "TRUE"]
LOCAL_RANK = int(os.getenv("LOCAL_RANK", 0))
WORLD_SIZE = int(os.getenv("WORLD_SIZE", 0))
WORLD_RANK = int(os.getenv("RANK", 0))
LOCAL_WORLD_SIZE = int(os.getenv("LOCAL_WORLD_SIZE", 0))

Common Errors

Error Message	Cause	Solution
`ImportError: No module named 'deepspeed'`	DeepSpeed not installed	`pip install deepspeed`
`NCCL error: unhandled system error`	NCCL communication failure	Check GPU interconnect; ensure NCCL is installed and GPUs are visible
`RuntimeError: CUDA error: invalid device ordinal`	LOCAL_RANK exceeds available GPUs	Ensure LOCAL_RANK < number of GPUs on the node
Model loading timeout	Large model takes too long to shard	Increase `startupTimeout` in model config (e.g., 1200s)

Compatibility Notes

Model support: Works best with HuggingFace Transformers models (AutoModelForCausalLM, etc.).
DeepSpeed config: A `ds-config.json` file can specify dtype (fp16/bf16), tensor parallel size, and kernel injection settings.
Multi-node: Supported via TorchServe's distributed worker spawning with proper RANK/WORLD_SIZE configuration.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment