Principle:Tensorflow Serving Multi Model Batching
| Knowledge Sources | |
|---|---|
| Domains | Performance, Scheduling |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
A scheduling mechanism that coordinates batch formation across multiple models or signatures, enabling efficient resource sharing in multi-model serving deployments.
Description
When serving multiple models, each model may have different latency requirements and batch size optima. The StreamingBatchScheduler provides a low-latency batch scheduling approach with per-task callbacks and configurable thread pools that can be shared or isolated per model.
Unlike BasicBatchScheduler which blocks callers until the batch completes, StreamingBatchScheduler uses a streaming approach where tasks are processed as batches form, with completion callbacks notifying callers asynchronously. This is better suited for mixed-latency workloads.
The BatchSchedulerRetrier wrapper adds automatic retry logic when schedulers are fully loaded (all threads busy).
Usage
Use StreamingBatchScheduler for multi-model deployments or when latency-sensitive models need to share resources with throughput-oriented models. Use BasicBatchScheduler (the default) for single-model deployments.
Theoretical Basis
# Abstract streaming scheduler (NOT real implementation)
def streaming_schedule(task):
if current_batch.has_room():
current_batch.add(task)
else:
start_new_batch()
current_batch.add(task)
if current_batch.is_full() or timeout_expired():
dispatch_batch_to_thread_pool(current_batch)
# Tasks are notified via callbacks when batch completes