Environment:Tensorflow Serving GPU CUDA Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Computing |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
NVIDIA GPU environment with CUDA 12.2, cuDNN 8.9.4.25, and TensorRT 8.6.1 on Ubuntu 20.04 for GPU-accelerated model serving and batched inference.
Description
This environment provides GPU acceleration for TensorFlow Serving inference. It is built on the NVIDIA CUDA 12.2 base image and includes the full CUDA toolkit, cuDNN 8.9 for deep learning primitives, and optional TensorRT 8.6 for inference optimization. GPU support targets NVIDIA compute capabilities 6.0 through 9.0 (Pascal through Hopper architectures), compiled using Clang 17 as the CUDA compiler. TPU support is also available as a separate build configuration.
Usage
Use this environment when serving models that require GPU acceleration for inference, particularly when batching is enabled. GPU serving is essential for achieving high throughput on compute-intensive models (e.g., large neural networks). The `--config=cuda` or `--config=cuda_clang` build flags activate GPU support at compile time. At runtime, use `--per_process_gpu_memory_fraction` to control GPU memory allocation.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Ubuntu 20.04 LTS | Base image for GPU Docker builds |
| Hardware | NVIDIA GPU with Compute Capability >= 6.0 | Pascal (sm_60) through Hopper (compute_90) |
| CUDA Toolkit | 12.2.0 | Hermetic version enforced in build |
| cuDNN | 8.9.4.25 | Deep learning primitives library |
| TensorRT | 8.6.1 | Optional; set `TF_NEED_TENSORRT=0` to disable |
| NCCL | 2.18.5 | Multi-GPU communication library |
| GPU Driver | Compatible with CUDA 12.2 | NVIDIA driver >= 525.60.13 |
| Compiler | Clang 17 | Required for CUDA compilation |
Dependencies
System Packages (CUDA)
- `cuda-command-line-tools-12-2`
- `cuda-cudart-dev-12-2`
- `cuda-nvcc-12-2`
- `cuda-cupti-12-2`
- `libcublas-12-2` (with `-dev`)
- `libcufft-12-2` (with `-dev`)
- `libcurand-12-2` (with `-dev`)
- `libcusolver-12-2` (with `-dev`)
- `libcusparse-12-2` (with `-dev`)
- `libnccl2` and `libnccl-dev`
- `libcudnn8` and `libcudnn8-dev`
Compiler
- `clang-17`
- `llvm-17`
- `lld-17`
Credentials
No specific credentials required for GPU access. For TPU builds:
- GCE access: TPU builds (`--config=tpu`) assume running on Google Compute Engine with `LIBTPU_ON_GCE` defined.
Quick Install
# Use the pre-built GPU Docker image (recommended)
docker pull tensorflow/serving:latest-gpu
# Run with GPU support
docker run --gpus all -p 8501:8501 \
--mount type=bind,source=/path/to/model,target=/models/my_model \
-e MODEL_NAME=my_model \
tensorflow/serving:latest-gpu
# Or build from source with CUDA
bazel build --config=cuda_clang -c opt tensorflow_serving/...
Code Evidence
CUDA build configuration from `.bazelrc:5-8`:
# Options used to build with CUDA.
build:cuda --repo_env TF_NEED_CUDA=1
build:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
build:cuda --@local_config_cuda//:enable_cuda
GPU compute capabilities from `.bazelrc:18`:
build:cuda_clang --repo_env=TF_CUDA_COMPUTE_CAPABILITIES="sm_60,sm_70,sm_80,compute_90"
Hermetic CUDA/cuDNN versions from `.bazelrc:20-21`:
build:cuda_clang --repo_env=HERMETIC_CUDA_VERSION="12.2.0"
build:cuda_clang --repo_env=HERMETIC_CUDNN_VERSION="8.9.4.25"
GPU memory fraction control from `main.cc:228-234`:
tensorflow::Flag(
"per_process_gpu_memory_fraction",
&options.per_process_gpu_memory_fraction,
"Fraction that each process occupies of the GPU memory space "
"the value is between 0.0 and 1.0 (with 0.0 as the default) "
"If 1.0, the server will allocate all the memory when the server "
"starts, If 0.0, Tensorflow will automatically select a value."),
TPU support from `.bazelrc:28-30`:
# Options used to build with TPU support.
build:tpu --define=with_tpu_support=true --define=framework_shared_object=false
build:tpu --copt=-DLIBTPU_ON_GCE
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Failed to initialize TPU system` | TPU not available or not on GCE | Verify TPU hardware is accessible; TPU builds require GCE environment |
| `CUDA driver version is insufficient for CUDA runtime version` | GPU driver too old | Update NVIDIA driver to >= 525.60.13 for CUDA 12.2 |
| `Could not load dynamic library 'libcudnn.so.8'` | cuDNN not installed | Install `libcudnn8` matching CUDA 12.2 |
| GPU OOM during serving | Model too large for GPU memory | Reduce `--per_process_gpu_memory_fraction` or use a GPU with more VRAM |
Compatibility Notes
- Compute Capability: Minimum sm_60 (Pascal). Older GPUs (Maxwell, Kepler) are not supported.
- TPU: Requires separate build config (`--config=tpu`). TPU builds disable session run timeout and use `tpu,serve` SavedModel tags by default.
- ARM GPUs: Not supported. ARM builds (`mkl_aarch64`) are CPU-inference only.
- Pre-built images: `tensorflow/serving:latest-gpu` is available on Docker Hub for users who do not need custom GPU builds.