Environment:Tensorflow Serving GPU CUDA Environment

Knowledge Sources	TensorFlow Serving Bazel CUDA Config GPU Devel Dockerfile
Domains	Infrastructure, GPU_Computing
Last Updated	2026-02-13 17:00 GMT

Overview

NVIDIA GPU environment with CUDA 12.2, cuDNN 8.9.4.25, and TensorRT 8.6.1 on Ubuntu 20.04 for GPU-accelerated model serving and batched inference.

Description

This environment provides GPU acceleration for TensorFlow Serving inference. It is built on the NVIDIA CUDA 12.2 base image and includes the full CUDA toolkit, cuDNN 8.9 for deep learning primitives, and optional TensorRT 8.6 for inference optimization. GPU support targets NVIDIA compute capabilities 6.0 through 9.0 (Pascal through Hopper architectures), compiled using Clang 17 as the CUDA compiler. TPU support is also available as a separate build configuration.

Usage

Use this environment when serving models that require GPU acceleration for inference, particularly when batching is enabled. GPU serving is essential for achieving high throughput on compute-intensive models (e.g., large neural networks). The `--config=cuda` or `--config=cuda_clang` build flags activate GPU support at compile time. At runtime, use `--per_process_gpu_memory_fraction` to control GPU memory allocation.

System Requirements

Category	Requirement	Notes
OS	Ubuntu 20.04 LTS	Base image for GPU Docker builds
Hardware	NVIDIA GPU with Compute Capability >= 6.0	Pascal (sm_60) through Hopper (compute_90)
CUDA Toolkit	12.2.0	Hermetic version enforced in build
cuDNN	8.9.4.25	Deep learning primitives library
TensorRT	8.6.1	Optional; set `TF_NEED_TENSORRT=0` to disable
NCCL	2.18.5	Multi-GPU communication library
GPU Driver	Compatible with CUDA 12.2	NVIDIA driver >= 525.60.13
Compiler	Clang 17	Required for CUDA compilation

Dependencies

System Packages (CUDA)

`cuda-command-line-tools-12-2`
`cuda-cudart-dev-12-2`
`cuda-nvcc-12-2`
`cuda-cupti-12-2`
`libcublas-12-2` (with `-dev`)
`libcufft-12-2` (with `-dev`)
`libcurand-12-2` (with `-dev`)
`libcusolver-12-2` (with `-dev`)
`libcusparse-12-2` (with `-dev`)
`libnccl2` and `libnccl-dev`
`libcudnn8` and `libcudnn8-dev`

Compiler

`clang-17`
`llvm-17`
`lld-17`

Credentials

No specific credentials required for GPU access. For TPU builds:

GCE access: TPU builds (`--config=tpu`) assume running on Google Compute Engine with `LIBTPU_ON_GCE` defined.

Quick Install

# Use the pre-built GPU Docker image (recommended)
docker pull tensorflow/serving:latest-gpu

# Run with GPU support
docker run --gpus all -p 8501:8501 \
  --mount type=bind,source=/path/to/model,target=/models/my_model \
  -e MODEL_NAME=my_model \
  tensorflow/serving:latest-gpu

# Or build from source with CUDA
bazel build --config=cuda_clang -c opt tensorflow_serving/...

Code Evidence

CUDA build configuration from `.bazelrc:5-8`:

# Options used to build with CUDA.
build:cuda --repo_env TF_NEED_CUDA=1
build:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
build:cuda --@local_config_cuda//:enable_cuda

GPU compute capabilities from `.bazelrc:18`:

build:cuda_clang --repo_env=TF_CUDA_COMPUTE_CAPABILITIES="sm_60,sm_70,sm_80,compute_90"

Hermetic CUDA/cuDNN versions from `.bazelrc:20-21`:

build:cuda_clang --repo_env=HERMETIC_CUDA_VERSION="12.2.0"
build:cuda_clang --repo_env=HERMETIC_CUDNN_VERSION="8.9.4.25"

GPU memory fraction control from `main.cc:228-234`:

tensorflow::Flag(
    "per_process_gpu_memory_fraction",
    &options.per_process_gpu_memory_fraction,
    "Fraction that each process occupies of the GPU memory space "
    "the value is between 0.0 and 1.0 (with 0.0 as the default) "
    "If 1.0, the server will allocate all the memory when the server "
    "starts, If 0.0, Tensorflow will automatically select a value."),

TPU support from `.bazelrc:28-30`:

# Options used to build with TPU support.
build:tpu --define=with_tpu_support=true --define=framework_shared_object=false
build:tpu --copt=-DLIBTPU_ON_GCE

Common Errors

Error Message	Cause	Solution
`Failed to initialize TPU system`	TPU not available or not on GCE	Verify TPU hardware is accessible; TPU builds require GCE environment
`CUDA driver version is insufficient for CUDA runtime version`	GPU driver too old	Update NVIDIA driver to >= 525.60.13 for CUDA 12.2
`Could not load dynamic library 'libcudnn.so.8'`	cuDNN not installed	Install `libcudnn8` matching CUDA 12.2
GPU OOM during serving	Model too large for GPU memory	Reduce `--per_process_gpu_memory_fraction` or use a GPU with more VRAM

Compatibility Notes

Compute Capability: Minimum sm_60 (Pascal). Older GPUs (Maxwell, Kepler) are not supported.
TPU: Requires separate build config (`--config=tpu`). TPU builds disable session run timeout and use `tpu,serve` SavedModel tags by default.
ARM GPUs: Not supported. ARM builds (`mkl_aarch64`) are CPU-inference only.
Pre-built images: `tensorflow/serving:latest-gpu` is available on Docker Hub for users who do not need custom GPU builds.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment