Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Tensorflow Serving Batched Inference Pipeline

From Leeroopedia
Knowledge Sources
Domains ML_Ops, Model_Serving, Performance
Last Updated 2026-02-13 17:00 GMT

Overview

End-to-end process for configuring and optimizing request batching in TensorFlow Serving to maximize throughput on hardware accelerators while controlling tail latency.

Description

This workflow covers the batching subsystem of TensorFlow Serving, which transparently groups individual inference requests into batches for efficient execution on GPUs and CPUs. The system uses BatchingSession to wrap TensorFlow sessions with automatic request aggregation, controlled by a configurable batch scheduler (BasicBatchScheduler or SharedBatchScheduler). Key parameters govern maximum batch size, timeout, thread pool size, and queue depth, enabling fine-tuned tradeoffs between throughput and latency.

Usage

Execute this workflow when you need to improve inference throughput on hardware accelerators (especially GPUs), handle high query volumes efficiently, or optimize resource utilization for models that benefit from batched execution. This is particularly important for production deployments where the cost per inference must be minimized.

Execution Steps

Step 1: Enable Batching

Activate the batching subsystem by passing the --enable_batching flag when starting TensorFlow Serving. This enables the BatchingSession wrapper around the underlying TensorFlow session, which intercepts individual Session::Run calls and groups them into batches before execution.

Key considerations:

  • Batching is disabled by default; it must be explicitly enabled
  • The --batching_parameters_file flag points to a configuration file with scheduling parameters
  • Batching works transparently; clients send individual requests as normal
  • The BatchingSession handles splitting batch results back to individual responses

Step 2: Configure Batch Scheduling Parameters

Create a batching parameters file that controls how requests are grouped and executed. The four primary parameters are max_batch_size (maximum requests per batch), batch_timeout_micros (maximum wait time before executing an underfull batch), num_batch_threads (parallel batch processing threads), and max_enqueued_batches (queue depth limit).

Key considerations:

  • max_batch_size controls the throughput/latency tradeoff and must fit within GPU memory
  • batch_timeout_micros bounds tail latency; set to 0 for latency-sensitive workloads
  • num_batch_threads should typically equal the number of CPU cores
  • max_enqueued_batches prevents unbounded queueing; set equal to num_batch_threads for online serving

Step 3: Tune for Target Hardware

Adjust parameters based on whether the model runs on CPU or GPU. For GPU workloads, use larger batch sizes (hundreds to thousands) to maximize GPU utilization and tune batch_timeout_micros to balance throughput with tail latency. For CPU workloads, start with batch_timeout_micros at 0 and experiment with small values in the 1-10ms range.

Key considerations:

  • GPU models benefit most from large batches due to parallel computation efficiency
  • CPU models may see diminishing returns from batching; experiment to verify
  • Use allowed_batch_sizes to restrict batch sizes to specific values if needed
  • The scheduler automatically pads batches to the nearest allowed size with dummy data

Step 4: Configure Multi-Model Batching

For servers serving multiple models or model versions, configure SharedBatchScheduler to share a single thread pool across all models. This prevents thread contention and ensures fair interleaving of batch processing across different model queues.

Key considerations:

  • SharedBatchScheduler maintains separate queues per model but shares execution threads
  • Each batch contains tasks from only one model/version
  • The scheduler interleaves batches from different queues for fairness
  • Queues can be dynamically added and removed as models are loaded/unloaded

Step 5: Enable Model Warmup

Configure SavedModel warmup to pre-execute representative inference requests during model loading, ensuring the computation graph, batch scheduler, and thread pools are initialized before serving real traffic. This eliminates the latency spike on first requests.

Key considerations:

  • Enable with --enable_model_warmup flag
  • Place warmup request data (PredictionLog records in TFRecord format) in the model's assets.extra/ directory
  • Warmup requests should be representative of production traffic patterns
  • Warmup is especially important for GPU models that need to allocate device memory

Step 6: Validate and Monitor Performance

Send concurrent test requests to verify batching is working correctly and measure throughput and latency. Use Prometheus metrics (via --monitoring_config_file) to monitor batch sizes, queue depths, and processing times in production.

Key considerations:

  • Send concurrent requests (e.g., --concurrency=10 in test clients) to trigger batch formation
  • Monitor :tensorflow:serving:batching_session:batch_size histogram for actual batch utilization
  • Compare throughput with and without batching to quantify improvement
  • Use TensorBoard profiling to identify bottlenecks in the inference pipeline

Execution Diagram

GitHub URL

Workflow Repository