Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Tensorflow Serving GPU CUDA Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, GPU_Computing
Last Updated 2026-02-13 17:00 GMT

Overview

NVIDIA GPU environment with CUDA 12.2, cuDNN 8.9.4.25, and TensorRT 8.6.1 on Ubuntu 20.04 for GPU-accelerated model serving and batched inference.

Description

This environment provides GPU acceleration for TensorFlow Serving inference. It is built on the NVIDIA CUDA 12.2 base image and includes the full CUDA toolkit, cuDNN 8.9 for deep learning primitives, and optional TensorRT 8.6 for inference optimization. GPU support targets NVIDIA compute capabilities 6.0 through 9.0 (Pascal through Hopper architectures), compiled using Clang 17 as the CUDA compiler. TPU support is also available as a separate build configuration.

Usage

Use this environment when serving models that require GPU acceleration for inference, particularly when batching is enabled. GPU serving is essential for achieving high throughput on compute-intensive models (e.g., large neural networks). The `--config=cuda` or `--config=cuda_clang` build flags activate GPU support at compile time. At runtime, use `--per_process_gpu_memory_fraction` to control GPU memory allocation.

System Requirements

Category Requirement Notes
OS Ubuntu 20.04 LTS Base image for GPU Docker builds
Hardware NVIDIA GPU with Compute Capability >= 6.0 Pascal (sm_60) through Hopper (compute_90)
CUDA Toolkit 12.2.0 Hermetic version enforced in build
cuDNN 8.9.4.25 Deep learning primitives library
TensorRT 8.6.1 Optional; set `TF_NEED_TENSORRT=0` to disable
NCCL 2.18.5 Multi-GPU communication library
GPU Driver Compatible with CUDA 12.2 NVIDIA driver >= 525.60.13
Compiler Clang 17 Required for CUDA compilation

Dependencies

System Packages (CUDA)

  • `cuda-command-line-tools-12-2`
  • `cuda-cudart-dev-12-2`
  • `cuda-nvcc-12-2`
  • `cuda-cupti-12-2`
  • `libcublas-12-2` (with `-dev`)
  • `libcufft-12-2` (with `-dev`)
  • `libcurand-12-2` (with `-dev`)
  • `libcusolver-12-2` (with `-dev`)
  • `libcusparse-12-2` (with `-dev`)
  • `libnccl2` and `libnccl-dev`
  • `libcudnn8` and `libcudnn8-dev`

Compiler

  • `clang-17`
  • `llvm-17`
  • `lld-17`

Credentials

No specific credentials required for GPU access. For TPU builds:

  • GCE access: TPU builds (`--config=tpu`) assume running on Google Compute Engine with `LIBTPU_ON_GCE` defined.

Quick Install

# Use the pre-built GPU Docker image (recommended)
docker pull tensorflow/serving:latest-gpu

# Run with GPU support
docker run --gpus all -p 8501:8501 \
  --mount type=bind,source=/path/to/model,target=/models/my_model \
  -e MODEL_NAME=my_model \
  tensorflow/serving:latest-gpu

# Or build from source with CUDA
bazel build --config=cuda_clang -c opt tensorflow_serving/...

Code Evidence

CUDA build configuration from `.bazelrc:5-8`:

# Options used to build with CUDA.
build:cuda --repo_env TF_NEED_CUDA=1
build:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
build:cuda --@local_config_cuda//:enable_cuda

GPU compute capabilities from `.bazelrc:18`:

build:cuda_clang --repo_env=TF_CUDA_COMPUTE_CAPABILITIES="sm_60,sm_70,sm_80,compute_90"

Hermetic CUDA/cuDNN versions from `.bazelrc:20-21`:

build:cuda_clang --repo_env=HERMETIC_CUDA_VERSION="12.2.0"
build:cuda_clang --repo_env=HERMETIC_CUDNN_VERSION="8.9.4.25"

GPU memory fraction control from `main.cc:228-234`:

tensorflow::Flag(
    "per_process_gpu_memory_fraction",
    &options.per_process_gpu_memory_fraction,
    "Fraction that each process occupies of the GPU memory space "
    "the value is between 0.0 and 1.0 (with 0.0 as the default) "
    "If 1.0, the server will allocate all the memory when the server "
    "starts, If 0.0, Tensorflow will automatically select a value."),

TPU support from `.bazelrc:28-30`:

# Options used to build with TPU support.
build:tpu --define=with_tpu_support=true --define=framework_shared_object=false
build:tpu --copt=-DLIBTPU_ON_GCE

Common Errors

Error Message Cause Solution
`Failed to initialize TPU system` TPU not available or not on GCE Verify TPU hardware is accessible; TPU builds require GCE environment
`CUDA driver version is insufficient for CUDA runtime version` GPU driver too old Update NVIDIA driver to >= 525.60.13 for CUDA 12.2
`Could not load dynamic library 'libcudnn.so.8'` cuDNN not installed Install `libcudnn8` matching CUDA 12.2
GPU OOM during serving Model too large for GPU memory Reduce `--per_process_gpu_memory_fraction` or use a GPU with more VRAM

Compatibility Notes

  • Compute Capability: Minimum sm_60 (Pascal). Older GPUs (Maxwell, Kepler) are not supported.
  • TPU: Requires separate build config (`--config=tpu`). TPU builds disable session run timeout and use `tpu,serve` SavedModel tags by default.
  • ARM GPUs: Not supported. ARM builds (`mkl_aarch64`) are CPU-inference only.
  • Pre-built images: `tensorflow/serving:latest-gpu` is available on Docker Hub for users who do not need custom GPU builds.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment