Environment:Pytorch Serve CUDA GPU Environment

Knowledge Sources	Pytorch_Serve NVIDIA CUDA Toolkit
Domains	Infrastructure, GPU_Computing
Last Updated	2026-02-13 00:00 GMT

Overview

NVIDIA GPU environment with CUDA toolkit for GPU-accelerated model inference on TorchServe.

Description

This environment provides GPU-accelerated inference for TorchServe. It requires an NVIDIA GPU with a compatible CUDA toolkit. The base handler auto-detects CUDA availability and GPU device capabilities at startup. For Ampere-generation GPUs (compute capability >= 8.0, e.g., A10G, A100, H100), TorchServe automatically enables tensor core optimizations. The environment supports CUDA versions from 9.2 through 12.1, with specific PyTorch wheel packages pinned per CUDA version. Additional optional accelerators (TensorRT, OpenVINO, IPEX) extend the GPU inference capabilities.

Usage

Use this environment for any TorchServe deployment requiring GPU-accelerated inference. This includes standard model serving on GPU, vLLM-based LLM serving, DeepSpeed distributed inference, tensor parallel inference, and any handler that targets `deviceType: gpu`.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+)	CUDA not supported on macOS; limited Windows support
Hardware	NVIDIA GPU	Compute capability >= 3.5 minimum; >= 8.0 for tensor core optimization
VRAM	Depends on model	8GB minimum for typical models; 40-80GB for large LLMs
Driver	NVIDIA Driver >= 450	Must match CUDA toolkit version
Disk	10GB+	CUDA toolkit and model weights

Dependencies

System Packages

NVIDIA GPU driver (>= 450 for CUDA 11.x, >= 525 for CUDA 12.x)
CUDA Toolkit: one of 9.2, 10.1, 10.2, 11.1, 11.3, 11.6, 11.7, 11.8, 12.1
cuDNN (matching CUDA version)

Python Packages

`torch` with CUDA support (e.g., `torch==2.4.0+cu118` for CUDA 11.8)
`torchvision` with matching CUDA build
`torchaudio` with matching CUDA build
`pynvml` == 11.5.0 (GPU metrics collection)

Optional Accelerators

`torch_tensorrt` — NVIDIA TensorRT integration for optimized inference
`openvino.torch` — OpenVINO backend for `torch.compile`
`intel_extension_for_pytorch` — Intel GPU (XPU) via `TS_IPEX_GPU_ENABLE=true`

Credentials

`TS_IPEX_GPU_ENABLE`: Set to `true` to enable Intel GPU (XPU) support instead of CUDA.
`TS_IPEX_ENABLE`: Set to `true` to enable Intel Extension for PyTorch CPU optimizations.

Quick Install

# Install PyTorch with CUDA 11.8
pip install torch==2.4.0+cu118 torchvision==0.19.0+cu118 torchaudio==2.4.0+cu118 \
  --extra-index-url https://download.pytorch.org/whl/cu118

# Install TorchServe
pip install torchserve torch-model-archiver

# Optional: TensorRT support
pip install torch_tensorrt

Code Evidence

CUDA detection and device selection from `ts/torch_handler/base_handler.py:155-177`:

if torch.cuda.is_available() and properties.get("gpu_id") is not None:
    self.map_location = "cuda"
    self.device = torch.device(
        self.map_location + ":" + str(properties.get("gpu_id"))
    )
elif (
    os.environ.get("TS_IPEX_GPU_ENABLE", "false") == "true"
    and properties.get("gpu_id") is not None
    and torch.xpu.is_available()
):
    self.map_location = "xpu"
    self.device = torch.device(
        self.map_location + ":" + str(properties.get("gpu_id"))
    )
elif torch.backends.mps.is_available() and properties.get("gpu_id") is not None:
    self.map_location = "mps"
    self.device = torch.device("mps")
elif XLA_AVAILABLE:
    self.device = xm.xla_device()
else:
    self.map_location = "cpu"
    self.device = torch.device(self.map_location)

Ampere tensor core auto-enable from `ts/torch_handler/base_handler.py:44-49`:

if torch.cuda.is_available() and torch.version.cuda:
    # If Ampere enable tensor cores which will give better performance
    # Ideally get yourself an A10G or A100 for optimal performance
    if torch.cuda.get_device_capability() >= (8, 0):
        torch.set_float32_matmul_precision("high")
        logger.info("Enabled tensor cores")

Optional accelerator imports from `ts/torch_handler/base_handler.py:61-98`:

try:
    import openvino.torch  # nopycln: import
    logger.info("OpenVINO backend enabled for torch.compile")
except ImportError:
    logger.warning("OpenVINO is not enabled")

try:
    import torch_tensorrt  # nopycln: import
    logger.info("Torch TensorRT enabled")
except ImportError:
    logger.warning("Torch TensorRT not enabled")

Common Errors

Error Message	Cause	Solution
`CUDA out of memory`	Model or batch size exceeds GPU VRAM	Reduce batch size, use model parallelism, or use a GPU with more VRAM
`torch.cuda.is_available() returns False`	CUDA not installed or driver mismatch	Install matching NVIDIA driver and CUDA toolkit
`RuntimeError: CUDA error: device-side assert triggered`	Invalid tensor operation on GPU	Check input shapes and data types; often caused by index out of bounds
`OpenVINO is not enabled`	OpenVINO not installed	`pip install openvino` if OpenVINO backend is desired

Compatibility Notes

CUDA 9.2: Not supported on Windows. Oldest supported CUDA version.
CUDA 11.8: Recommended stable version with broad hardware support.
CUDA 12.1: Latest supported version for newest GPUs.
ROCm (AMD): Supported via `torch.version.hip` detection. Versions 6.0, 6.1, 6.2 supported. Not available on macOS or Windows.
Apple MPS: Supported for macOS with Apple Silicon (M1/M2+). Auto-detected via `torch.backends.mps.is_available()`.
Intel XPU: Requires `TS_IPEX_GPU_ENABLE=true` environment variable and `intel_extension_for_pytorch` package.
Google TPU/XLA: Supported via `torch_xla` package. Auto-detected at import.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment