Workflow:Tensorflow Serving Batched Inference Pipeline

Knowledge Sources	TensorFlow Serving Batching Guide Performance Guide
Domains	ML_Ops, Model_Serving, Performance
Last Updated	2026-02-13 17:00 GMT

Overview

End-to-end process for configuring and optimizing request batching in TensorFlow Serving to maximize throughput on hardware accelerators while controlling tail latency.

Description

This workflow covers the batching subsystem of TensorFlow Serving, which transparently groups individual inference requests into batches for efficient execution on GPUs and CPUs. The system uses BatchingSession to wrap TensorFlow sessions with automatic request aggregation, controlled by a configurable batch scheduler (BasicBatchScheduler or SharedBatchScheduler). Key parameters govern maximum batch size, timeout, thread pool size, and queue depth, enabling fine-tuned tradeoffs between throughput and latency.

Usage

Execute this workflow when you need to improve inference throughput on hardware accelerators (especially GPUs), handle high query volumes efficiently, or optimize resource utilization for models that benefit from batched execution. This is particularly important for production deployments where the cost per inference must be minimized.

Execution Steps

Step 1: Enable Batching

Activate the batching subsystem by passing the --enable_batching flag when starting TensorFlow Serving. This enables the BatchingSession wrapper around the underlying TensorFlow session, which intercepts individual Session::Run calls and groups them into batches before execution.

Key considerations:

Batching is disabled by default; it must be explicitly enabled
The --batching_parameters_file flag points to a configuration file with scheduling parameters
Batching works transparently; clients send individual requests as normal
The BatchingSession handles splitting batch results back to individual responses

Step 2: Configure Batch Scheduling Parameters

Create a batching parameters file that controls how requests are grouped and executed. The four primary parameters are max_batch_size (maximum requests per batch), batch_timeout_micros (maximum wait time before executing an underfull batch), num_batch_threads (parallel batch processing threads), and max_enqueued_batches (queue depth limit).

Key considerations:

max_batch_size controls the throughput/latency tradeoff and must fit within GPU memory
batch_timeout_micros bounds tail latency; set to 0 for latency-sensitive workloads
num_batch_threads should typically equal the number of CPU cores
max_enqueued_batches prevents unbounded queueing; set equal to num_batch_threads for online serving

Step 3: Tune for Target Hardware

Adjust parameters based on whether the model runs on CPU or GPU. For GPU workloads, use larger batch sizes (hundreds to thousands) to maximize GPU utilization and tune batch_timeout_micros to balance throughput with tail latency. For CPU workloads, start with batch_timeout_micros at 0 and experiment with small values in the 1-10ms range.

Key considerations:

GPU models benefit most from large batches due to parallel computation efficiency
CPU models may see diminishing returns from batching; experiment to verify
Use allowed_batch_sizes to restrict batch sizes to specific values if needed
The scheduler automatically pads batches to the nearest allowed size with dummy data

Step 4: Configure Multi-Model Batching

For servers serving multiple models or model versions, configure SharedBatchScheduler to share a single thread pool across all models. This prevents thread contention and ensures fair interleaving of batch processing across different model queues.

Key considerations:

SharedBatchScheduler maintains separate queues per model but shares execution threads
Each batch contains tasks from only one model/version
The scheduler interleaves batches from different queues for fairness
Queues can be dynamically added and removed as models are loaded/unloaded

Step 5: Enable Model Warmup

Configure SavedModel warmup to pre-execute representative inference requests during model loading, ensuring the computation graph, batch scheduler, and thread pools are initialized before serving real traffic. This eliminates the latency spike on first requests.

Key considerations:

Enable with --enable_model_warmup flag
Place warmup request data (PredictionLog records in TFRecord format) in the model's assets.extra/ directory
Warmup requests should be representative of production traffic patterns
Warmup is especially important for GPU models that need to allocate device memory

Step 6: Validate and Monitor Performance

Send concurrent test requests to verify batching is working correctly and measure throughput and latency. Use Prometheus metrics (via --monitoring_config_file) to monitor batch sizes, queue depths, and processing times in production.

Key considerations:

Send concurrent requests (e.g., --concurrency=10 in test clients) to trigger batch formation
Monitor :tensorflow:serving:batching_session:batch_size histogram for actual batch utilization
Compare throughput with and without batching to quantify improvement
Use TensorBoard profiling to identify bottlenecks in the inference pipeline

Execution Diagram

GitHub URL

Workflow Repository