Heuristic:Tensorflow Serving GPU Memory And CPU Optimization
| Knowledge Sources | |
|---|---|
| Domains | Optimization, GPU_Computing, ML_Serving |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
Optimize serving performance through GPU memory fraction control, CPU instruction set targeting, filesystem cache flushing, and auto-configured TensorFlow parallelism.
Description
TensorFlow Serving performance depends on efficient use of both GPU and CPU resources. Key optimization levers include: controlling GPU memory allocation with `per_process_gpu_memory_fraction`, rebuilding with host-native CPU instructions (AVX, SSE4, FMA), flushing filesystem caches to reduce memory footprint, and allowing TensorFlow's auto-configuration for intra/inter-op parallelism. The system also benefits from deploying on fewer, larger machines rather than many small ones.
Usage
Use this heuristic when optimizing TensorFlow Serving for production performance. Apply these techniques after basic deployment is working but throughput or latency does not meet requirements. The performance guide notes: "tuning its performance is somewhat case-dependent and there are very few universal rules."
The Insight (Rule of Thumb)
- GPU Memory: Set `--per_process_gpu_memory_fraction` to control GPU allocation. Default (0.0) lets TensorFlow auto-select. Setting 1.0 pre-allocates all GPU memory at startup (reduces fragmentation but prevents GPU sharing).
- CPU Instructions: Rebuild with `--copt=-march=native` to target the host CPU's instruction set. If you see "Your CPU supports instructions that this TensorFlow binary was not compiled to use", you are leaving performance on the table.
- Filesystem Caches: Keep `--flush_filesystem_caches=true` (default) to reduce memory consumption. Disable only if model files are accessed post-load.
- Parallelism: Leave `tensorflow_intra_op_parallelism` and `tensorflow_inter_op_parallelism` on auto unless extensive experimentation shows better values for your specific workload.
- gRPC vs REST: Prefer gRPC for slightly better performance. Both are highly tuned but "the gRPC surface is observed to be slightly more performant."
- Deployment Topology: Prefer fewer, larger machines over many small ones for better resource utilization and lower fixed costs.
Reasoning
GPU memory pre-allocation (1.0) eliminates allocation overhead during inference but prevents multi-process GPU sharing. Auto-selection (0.0) is more flexible but may cause fragmentation. Native CPU compilation generates SIMD instructions that significantly accelerate mathematical operations common in ML inference (matrix multiplications, element-wise operations). The performance guide confirms this yields measurable improvement.
Filesystem cache flushing reduces the process memory footprint by releasing kernel buffer caches for model files that are no longer needed after loading. The trade-off is that if model files are re-accessed (e.g., during re-loading), they must be read from disk again.
TensorFlow's auto-configuration for parallelism is based on runtime analysis of the hardware and model. Manual overrides should only be used after "many experiments" (performance guide's recommendation), as incorrect settings can severely degrade performance.
Code Evidence
GPU memory fraction from `main.cc:228-234`:
tensorflow::Flag(
"per_process_gpu_memory_fraction",
&options.per_process_gpu_memory_fraction,
"Fraction that each process occupies of the GPU memory space "
"the value is between 0.0 and 1.0 (with 0.0 as the default) "
"If 1.0, the server will allocate all the memory when the server "
"starts, If 0.0, Tensorflow will automatically select a value."),
Filesystem cache flushing from `main.cc:182-190`:
tensorflow::Flag("flush_filesystem_caches",
&options.flush_filesystem_caches,
"If true (the default), filesystem caches will be "
"flushed after the initial load of all servables, and "
"after each subsequent individual servable reload (if "
"the number of load threads is 1). This reduces memory "
"consumption of the model server, at the potential cost "
"of cache misses if model files are accessed after "
"servables are loaded."),
Auto-configured parallelism from `main.cc:191-216`:
tensorflow::Flag("tensorflow_intra_op_parallelism",
&options.tensorflow_intra_op_parallelism,
"Number of threads to use to parallelize the execution"
"of an individual op. Auto-configured by default.");
tensorflow::Flag("tensorflow_inter_op_parallelism",
&options.tensorflow_inter_op_parallelism,
"Controls the number of operators that can be executed "
"simultaneously. Auto-configured by default.");
gRPC performance note from `performance.md:79-80`:
Both API surfaces are highly tuned and add minimal latency but in
practice, the gRPC surface is observed to be slightly more performant.
Deployment topology from `performance.md:113-117`:
TensorFlow Serving is more efficient when deployed on fewer, larger (more CPU
and RAM) machines (i.e. a Deployment with a lower replicas in Kubernetes terms).
This is due to a better potential for multi-tenant deployment to utilize the
hardware and lower fixed costs (RPC server, TensorFlow runtime, etc.).
CPU instruction warning from `performance.md:138-145`:
`Your CPU supports instructions that this TensorFlow binary was not compiled to
use: AVX2 FMA`
If you see this log entry (possibly different extensions than the 2 listed) at
TensorFlow Serving start-up, it means you can rebuild TensorFlow Serving and
target your particular host's platform and enjoy better performance.