Principle:Kserve Kserve Batch Inference
| Knowledge Sources | |
|---|---|
| Domains | MLOps, Performance_Optimization, Model_Serving |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A throughput optimization pattern that groups multiple individual inference requests into a single batch before forwarding them to the model server, amortizing per-request overhead.
Description
Batch Inference improves model serving throughput by collecting individual prediction requests over a configurable time window or until a maximum batch size is reached, then forwarding the aggregated batch as a single request to the model predictor. This technique exploits the parallel computation capabilities of GPUs and vectorized CPU operations, where processing N inputs together is significantly faster than processing N inputs sequentially.
KServe implements batching via an annotation-driven sidecar that intercepts requests before they reach the predictor container. The batcher is configured through InferenceService annotations that control maximum batch size, maximum latency (the time window for collecting requests), and the input format expected by the model. Individual responses are demultiplexed and returned to the correct originating client.
Usage
Use this principle when:
- Model inference has significant per-request overhead (GPU kernel launch, memory allocation)
- High throughput is more important than minimal per-request latency
- Multiple clients send requests concurrently to the same model
- GPU utilization is low due to small individual request sizes
Theoretical Basis
# Batching pattern (NOT implementation code)
Batcher configuration via annotations:
serving.kserve.io/batcherMaxBatchSize: "32"
serving.kserve.io/batcherMaxLatency: "500" (milliseconds)
Batching algorithm:
1. Batcher receives individual request R_i
2. Add R_i to pending batch B
3. Check flush conditions:
a. |B| >= maxBatchSize → flush immediately
b. time_since_first_request >= maxLatency → flush on timeout
4. On flush:
a. Concatenate all inputs in B into a single batch tensor
b. Send batch request to predictor
c. Receive batch response
d. Split response into individual results
e. Return each result to the corresponding client
Tradeoff:
Small batch + low latency → lower throughput, faster response
Large batch + high latency → higher throughput, slower individual response
Optimal settings depend on request arrival rate and model characteristics