Environment:Bentoml BentoML NVIDIA GPU Resource

Knowledge Sources	BentoML nvidia-ml-py
Domains	Infrastructure, GPU_Computing
Last Updated	2026-02-13 16:00 GMT

Overview

NVIDIA GPU runtime environment using `pynvml` (nvidia-ml-py) for GPU detection, memory querying, and CUDA device assignment in BentoML runners and services.

Description

BentoML supports NVIDIA GPU acceleration for model serving through its resource management system. GPU detection is performed via the `pynvml` library (Python bindings for NVML), which queries the NVIDIA driver for device count and memory information. The `NvidiaGpuResource` class handles GPU device enumeration, validation, and assignment to workers via `CUDA_VISIBLE_DEVICES`. The resource system integrates with the runner strategy to automatically determine worker counts and GPU assignments. Supported GPU types for BentoCloud deployments include NVIDIA B200, GB200, H200, H100, A100, A10G, L4, T4, V100, P100, K80, P4, and RTX Pro 6000, as well as AMD MI300X, MI325X, MI355X GPUs.

Usage

Use this environment when running GPU-accelerated model inference through BentoML. Required for any service or runner that declares `"nvidia.com/gpu"` in its `SUPPORTED_RESOURCES` tuple, including frameworks like PyTorch, TensorFlow, ONNX Runtime, Transformers, Diffusers, CatBoost, XGBoost, Detectron2, EasyOCR, and Keras.

System Requirements

Category	Requirement	Notes
OS	Linux or Windows	GPU detection via NVML; not available on macOS
Hardware	NVIDIA GPU	Any GPU supported by the installed NVIDIA driver
Driver	NVIDIA GPU Driver	Must be loaded for pynvml to initialize
Library	nvidia-ml-py (pynvml)	Included in BentoML core dependencies

Dependencies

System Packages

NVIDIA GPU Driver (installed at OS level)
CUDA Toolkit (if required by specific ML frameworks)

Python Packages

`nvidia-ml-py` (included in BentoML core deps, provides `pynvml`)
Framework-specific packages: `torch`, `tensorflow`, `onnxruntime-gpu`, etc. (per model needs)

Credentials

No specific credentials required. The following environment variable is set by the runner strategy:

`CUDA_VISIBLE_DEVICES`: Automatically set by `DefaultStrategy.get_worker_env()` to control which GPU(s) a worker can access. Set to `"-1"` to disable GPU for CPU-only workers.

Quick Install

# nvidia-ml-py is already a core BentoML dependency
pip install bentoml

# Verify GPU detection
python -c "import pynvml; pynvml.nvmlInit(); print(f'GPUs: {pynvml.nvmlDeviceGetCount()}'); pynvml.nvmlShutdown()"

Code Evidence

GPU detection from `resource.py:240-257`:

class NvidiaGpuResource(Resource[t.List[int]], resource_id="nvidia.com/gpu"):
    @classmethod
    @functools.lru_cache(maxsize=1)
    def from_system(cls) -> list[int]:
        import pynvml
        try:
            pynvml.nvmlInit()
            device_count = pynvml.nvmlDeviceGetCount()
            return list(range(device_count))
        except (
            pynvml.NVMLError_LibraryNotFound,
            pynvml.NVMLError_DriverNotLoaded,
            OSError,
        ):
            logger.debug("GPU not detected. Unable to initialize pynvml lib.")
            return []

GPU memory query from `resource.py:274-306`:

def get_gpu_memory(dev: int) -> t.Tuple[float, float]:
    """Return Total Memory and Free Memory in given GPU device. in MiB"""
    import pynvml.nvml
    from pynvml.smi import nvidia_smi
    try:
        inst = nvidia_smi.getInstance()
        query = inst.DeviceQuery(dev)
    except (pynvml.nvml.NVMLError, OSError):
        return 0.0, 0.0

CUDA_VISIBLE_DEVICES assignment from `strategy.py:150-162`:

assigned_gpu = nvidia_gpus[
    assigned_resource_per_worker * worker_index :
    assigned_resource_per_worker * (worker_index + 1)
]
dev = ",".join(map(str, assigned_gpu))
environ["CUDA_VISIBLE_DEVICES"] = dev

Supported GPU types for cloud deployment from `service/config.py:20-38`:

GpuLiteralType = Literal[
    "nvidia-b200", "nvidia-gb200", "nvidia-rtx-pro-6000",
    "nvidia-h200-141gb", "nvidia-tesla-h100", "nvidia-tesla-t4",
    "nvidia-tesla-a100", "nvidia-a100-80gb", "nvidia-h100-80gb",
    "nvidia-a10g", "nvidia-l4", "nvidia-tesla-v100",
    "nvidia-tesla-p100", "nvidia-tesla-k80", "nvidia-tesla-p4",
    "amd-mi300x", "amd-mi325x", "amd-mi355x",
]

Common Errors

Error Message	Cause	Solution
`NVMLError_LibraryNotFound`	NVIDIA driver not installed or pynvml cannot find NVML	Install NVIDIA GPU driver; ensure `libnvidia-ml.so` is on the library path
`NVMLError_DriverNotLoaded`	NVIDIA driver installed but not loaded	Restart the machine or run `sudo modprobe nvidia`
`BentoMLConfigException: GPU device index in X is greater than system available`	Requested GPU index exceeds physical GPU count	Reduce GPU count in resource config or check `nvidia-smi` output
`IndexError: There aren't enough assigned GPU(s) for given worker id`	`workers_per_resource` < 1.0 requires more GPUs than available	Reduce `workers_per_resource` or add more GPUs

Compatibility Notes

macOS: NVIDIA GPUs are not supported on macOS. GPU detection will return an empty list.
CPU fallback: If no GPU is detected and the runnable supports CPU, BentoML falls back to CPU with a warning: "No known supported resource available for X, falling back to using CPU."
Fractional GPU: Setting `workers_per_resource` to a float like 0.5 assigns multiple GPUs per worker. Values > 1 are not supported in the default strategy.
CUDA_VISIBLE_DEVICES: Set to `"-1"` for CPU-only workers to ensure GPU is fully disabled.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment