Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Pytorch Serve DeepSpeed Environment

From Leeroopedia
Knowledge Sources
Domains Distributed_Inference, LLMs
Last Updated 2026-02-13 00:00 GMT

Overview

DeepSpeed inference environment for distributed large model serving with TorchServe.

Description

This environment provides the DeepSpeed library for serving large models that exceed single-GPU memory via model parallelism. The `BaseDeepSpeedHandler` uses the `LOCAL_RANK` environment variable for device assignment and delegates to DeepSpeed's inference engine for automatic model partitioning. DeepSpeed supports fp16/bf16 inference, kernel fusion, and heterogeneous memory management for models like Bloom, GPT-NeoX, and other large language models.

Usage

Use this environment when serving models that are too large for a single GPU and require DeepSpeed's tensor parallelism or inference optimizations. Required for the Large Model Inference workflow when the DeepSpeed parallelism strategy is selected.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+) DeepSpeed has limited non-Linux support
Hardware Multiple NVIDIA GPUs Minimum 2 GPUs for parallelism
VRAM 16GB+ per GPU More for larger models
Disk 50GB+ Model weights and DeepSpeed cache

Dependencies

System Packages

  • NVIDIA GPU driver >= 450
  • CUDA Toolkit >= 11.0
  • NCCL (for multi-GPU communication)

Python Packages

  • `deepspeed`
  • `torch` with CUDA support
  • `transformers` >= 4.34.0
  • `torchserve`

Credentials

The following environment variables must be set for distributed inference:

  • `LOCAL_RANK`: Local rank of the process on the node (default: 0). Used by `BaseDeepSpeedHandler` for device assignment.
  • `WORLD_SIZE`: Total number of processes across all nodes.
  • `RANK`: Global rank of the process.
  • `LOCAL_WORLD_SIZE`: Number of processes on the local node.

Quick Install

# Install DeepSpeed with dependencies
pip install deepspeed transformers>=4.34.0

# Install TorchServe
pip install torchserve torch-model-archiver

Code Evidence

DeepSpeed import from `ts/handler_utils/distributed/deepspeed.py:6`:

import deepspeed

Device assignment via LOCAL_RANK from `ts/torch_handler/distributed/base_deepspeed_handler.py:13-14`:

def initialize(self, ctx: Context):
    self.device = int(os.getenv("LOCAL_RANK", 0))

Worker environment variables from `ts/model_service_worker.py:23-27`:

BENCHMARK = os.getenv("TS_BENCHMARK") in ["True", "true", "TRUE"]
LOCAL_RANK = int(os.getenv("LOCAL_RANK", 0))
WORLD_SIZE = int(os.getenv("WORLD_SIZE", 0))
WORLD_RANK = int(os.getenv("RANK", 0))
LOCAL_WORLD_SIZE = int(os.getenv("LOCAL_WORLD_SIZE", 0))

Common Errors

Error Message Cause Solution
`ImportError: No module named 'deepspeed'` DeepSpeed not installed `pip install deepspeed`
`NCCL error: unhandled system error` NCCL communication failure Check GPU interconnect; ensure NCCL is installed and GPUs are visible
`RuntimeError: CUDA error: invalid device ordinal` LOCAL_RANK exceeds available GPUs Ensure LOCAL_RANK < number of GPUs on the node
Model loading timeout Large model takes too long to shard Increase `startupTimeout` in model config (e.g., 1200s)

Compatibility Notes

  • Model support: Works best with HuggingFace Transformers models (AutoModelForCausalLM, etc.).
  • DeepSpeed config: A `ds-config.json` file can specify dtype (fp16/bf16), tensor parallel size, and kernel injection settings.
  • Multi-node: Supported via TorchServe's distributed worker spawning with proper RANK/WORLD_SIZE configuration.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment