Environment:Pytorch Serve DeepSpeed Environment
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Inference, LLMs |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
DeepSpeed inference environment for distributed large model serving with TorchServe.
Description
This environment provides the DeepSpeed library for serving large models that exceed single-GPU memory via model parallelism. The `BaseDeepSpeedHandler` uses the `LOCAL_RANK` environment variable for device assignment and delegates to DeepSpeed's inference engine for automatic model partitioning. DeepSpeed supports fp16/bf16 inference, kernel fusion, and heterogeneous memory management for models like Bloom, GPT-NeoX, and other large language models.
Usage
Use this environment when serving models that are too large for a single GPU and require DeepSpeed's tensor parallelism or inference optimizations. Required for the Large Model Inference workflow when the DeepSpeed parallelism strategy is selected.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | DeepSpeed has limited non-Linux support |
| Hardware | Multiple NVIDIA GPUs | Minimum 2 GPUs for parallelism |
| VRAM | 16GB+ per GPU | More for larger models |
| Disk | 50GB+ | Model weights and DeepSpeed cache |
Dependencies
System Packages
- NVIDIA GPU driver >= 450
- CUDA Toolkit >= 11.0
- NCCL (for multi-GPU communication)
Python Packages
- `deepspeed`
- `torch` with CUDA support
- `transformers` >= 4.34.0
- `torchserve`
Credentials
The following environment variables must be set for distributed inference:
- `LOCAL_RANK`: Local rank of the process on the node (default: 0). Used by `BaseDeepSpeedHandler` for device assignment.
- `WORLD_SIZE`: Total number of processes across all nodes.
- `RANK`: Global rank of the process.
- `LOCAL_WORLD_SIZE`: Number of processes on the local node.
Quick Install
# Install DeepSpeed with dependencies
pip install deepspeed transformers>=4.34.0
# Install TorchServe
pip install torchserve torch-model-archiver
Code Evidence
DeepSpeed import from `ts/handler_utils/distributed/deepspeed.py:6`:
import deepspeed
Device assignment via LOCAL_RANK from `ts/torch_handler/distributed/base_deepspeed_handler.py:13-14`:
def initialize(self, ctx: Context):
self.device = int(os.getenv("LOCAL_RANK", 0))
Worker environment variables from `ts/model_service_worker.py:23-27`:
BENCHMARK = os.getenv("TS_BENCHMARK") in ["True", "true", "TRUE"]
LOCAL_RANK = int(os.getenv("LOCAL_RANK", 0))
WORLD_SIZE = int(os.getenv("WORLD_SIZE", 0))
WORLD_RANK = int(os.getenv("RANK", 0))
LOCAL_WORLD_SIZE = int(os.getenv("LOCAL_WORLD_SIZE", 0))
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: No module named 'deepspeed'` | DeepSpeed not installed | `pip install deepspeed` |
| `NCCL error: unhandled system error` | NCCL communication failure | Check GPU interconnect; ensure NCCL is installed and GPUs are visible |
| `RuntimeError: CUDA error: invalid device ordinal` | LOCAL_RANK exceeds available GPUs | Ensure LOCAL_RANK < number of GPUs on the node |
| Model loading timeout | Large model takes too long to shard | Increase `startupTimeout` in model config (e.g., 1200s) |
Compatibility Notes
- Model support: Works best with HuggingFace Transformers models (AutoModelForCausalLM, etc.).
- DeepSpeed config: A `ds-config.json` file can specify dtype (fp16/bf16), tensor parallel size, and kernel injection settings.
- Multi-node: Supported via TorchServe's distributed worker spawning with proper RANK/WORLD_SIZE configuration.