Environment:Pytorch Serve CUDA GPU Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Computing |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
NVIDIA GPU environment with CUDA toolkit for GPU-accelerated model inference on TorchServe.
Description
This environment provides GPU-accelerated inference for TorchServe. It requires an NVIDIA GPU with a compatible CUDA toolkit. The base handler auto-detects CUDA availability and GPU device capabilities at startup. For Ampere-generation GPUs (compute capability >= 8.0, e.g., A10G, A100, H100), TorchServe automatically enables tensor core optimizations. The environment supports CUDA versions from 9.2 through 12.1, with specific PyTorch wheel packages pinned per CUDA version. Additional optional accelerators (TensorRT, OpenVINO, IPEX) extend the GPU inference capabilities.
Usage
Use this environment for any TorchServe deployment requiring GPU-accelerated inference. This includes standard model serving on GPU, vLLM-based LLM serving, DeepSpeed distributed inference, tensor parallel inference, and any handler that targets `deviceType: gpu`.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | CUDA not supported on macOS; limited Windows support |
| Hardware | NVIDIA GPU | Compute capability >= 3.5 minimum; >= 8.0 for tensor core optimization |
| VRAM | Depends on model | 8GB minimum for typical models; 40-80GB for large LLMs |
| Driver | NVIDIA Driver >= 450 | Must match CUDA toolkit version |
| Disk | 10GB+ | CUDA toolkit and model weights |
Dependencies
System Packages
- NVIDIA GPU driver (>= 450 for CUDA 11.x, >= 525 for CUDA 12.x)
- CUDA Toolkit: one of 9.2, 10.1, 10.2, 11.1, 11.3, 11.6, 11.7, 11.8, 12.1
- cuDNN (matching CUDA version)
Python Packages
- `torch` with CUDA support (e.g., `torch==2.4.0+cu118` for CUDA 11.8)
- `torchvision` with matching CUDA build
- `torchaudio` with matching CUDA build
- `pynvml` == 11.5.0 (GPU metrics collection)
Optional Accelerators
- `torch_tensorrt` — NVIDIA TensorRT integration for optimized inference
- `openvino.torch` — OpenVINO backend for `torch.compile`
- `intel_extension_for_pytorch` — Intel GPU (XPU) via `TS_IPEX_GPU_ENABLE=true`
Credentials
- `TS_IPEX_GPU_ENABLE`: Set to `true` to enable Intel GPU (XPU) support instead of CUDA.
- `TS_IPEX_ENABLE`: Set to `true` to enable Intel Extension for PyTorch CPU optimizations.
Quick Install
# Install PyTorch with CUDA 11.8
pip install torch==2.4.0+cu118 torchvision==0.19.0+cu118 torchaudio==2.4.0+cu118 \
--extra-index-url https://download.pytorch.org/whl/cu118
# Install TorchServe
pip install torchserve torch-model-archiver
# Optional: TensorRT support
pip install torch_tensorrt
Code Evidence
CUDA detection and device selection from `ts/torch_handler/base_handler.py:155-177`:
if torch.cuda.is_available() and properties.get("gpu_id") is not None:
self.map_location = "cuda"
self.device = torch.device(
self.map_location + ":" + str(properties.get("gpu_id"))
)
elif (
os.environ.get("TS_IPEX_GPU_ENABLE", "false") == "true"
and properties.get("gpu_id") is not None
and torch.xpu.is_available()
):
self.map_location = "xpu"
self.device = torch.device(
self.map_location + ":" + str(properties.get("gpu_id"))
)
elif torch.backends.mps.is_available() and properties.get("gpu_id") is not None:
self.map_location = "mps"
self.device = torch.device("mps")
elif XLA_AVAILABLE:
self.device = xm.xla_device()
else:
self.map_location = "cpu"
self.device = torch.device(self.map_location)
Ampere tensor core auto-enable from `ts/torch_handler/base_handler.py:44-49`:
if torch.cuda.is_available() and torch.version.cuda:
# If Ampere enable tensor cores which will give better performance
# Ideally get yourself an A10G or A100 for optimal performance
if torch.cuda.get_device_capability() >= (8, 0):
torch.set_float32_matmul_precision("high")
logger.info("Enabled tensor cores")
Optional accelerator imports from `ts/torch_handler/base_handler.py:61-98`:
try:
import openvino.torch # nopycln: import
logger.info("OpenVINO backend enabled for torch.compile")
except ImportError:
logger.warning("OpenVINO is not enabled")
try:
import torch_tensorrt # nopycln: import
logger.info("Torch TensorRT enabled")
except ImportError:
logger.warning("Torch TensorRT not enabled")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `CUDA out of memory` | Model or batch size exceeds GPU VRAM | Reduce batch size, use model parallelism, or use a GPU with more VRAM |
| `torch.cuda.is_available() returns False` | CUDA not installed or driver mismatch | Install matching NVIDIA driver and CUDA toolkit |
| `RuntimeError: CUDA error: device-side assert triggered` | Invalid tensor operation on GPU | Check input shapes and data types; often caused by index out of bounds |
| `OpenVINO is not enabled` | OpenVINO not installed | `pip install openvino` if OpenVINO backend is desired |
Compatibility Notes
- CUDA 9.2: Not supported on Windows. Oldest supported CUDA version.
- CUDA 11.8: Recommended stable version with broad hardware support.
- CUDA 12.1: Latest supported version for newest GPUs.
- ROCm (AMD): Supported via `torch.version.hip` detection. Versions 6.0, 6.1, 6.2 supported. Not available on macOS or Windows.
- Apple MPS: Supported for macOS with Apple Silicon (M1/M2+). Auto-detected via `torch.backends.mps.is_available()`.
- Intel XPU: Requires `TS_IPEX_GPU_ENABLE=true` environment variable and `intel_extension_for_pytorch` package.
- Google TPU/XLA: Supported via `torch_xla` package. Auto-detected at import.