Heuristic:Bentoml BentoML Adaptive Batching Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, ML_Serving |
| Last Updated | 2026-02-13 16:00 GMT |
Overview
Tuning guidance for BentoML's adaptive batching system, including the CORK algorithm's optimizer parameters, max_batch_size, and max_latency_ms to balance throughput vs latency.
Description
BentoML implements an adaptive batching system using the CORK algorithm via the `CorkDispatcher` class. The system dynamically groups incoming requests into batches based on real-time traffic patterns. An `Optimizer` component uses linear regression on historical batch data to learn the relationship `duration = o_a * n + o_b` (where `n` is batch size). The optimizer skips the first 2 samples as inaccurate, keeps 50 samples for regression, and refreshes parameters every 5 seconds. During startup, it performs a training phase with progressively increasing batch sizes (1, 2, 3) to warm up the model. The controller polls at 1ms intervals (`TICKET_INTERVAL = 0.001`) and applies a 0.95 decay rate on wait time estimates.
Usage
Use this heuristic when tuning BentoML service performance. Apply when encountering HTTP 503 errors (exceeding max_latency_ms), when throughput is lower than expected, or when you need to balance latency vs throughput for a specific deployment. Critical for batchable API endpoints decorated with `@bentoml.api(batchable=True)`.
The Insight (Rule of Thumb)
- Action: Set `max_batch_size` based on available GPU memory and model requirements. Set `max_latency_ms` based on acceptable end-to-end response time.
- Default values (v1 config): `max_batch_size=100`, `max_latency_ms=60000` (60 seconds).
- Increase `max_latency_ms`: If seeing 503 errors, the max latency is likely too low for the model's processing time. The system warns: "a service has a max latency that is likely too low for serving."
- Sync-to-batch caveat: When a synchronous endpoint calls a batchable endpoint in another service, it sends only one request at a time (concurrency=1). Set `threads=N` in `@bentoml.service` to enable concurrent requests and allow batching.
- Batchable API constraints: Must accept a list/array type and only one parameter (plus optional `bentoml.Context`). Use Pydantic models to group multiple parameters.
- Trade-off: Higher `max_batch_size` increases throughput but uses more memory. Higher `max_latency_ms` allows larger batches but increases individual request latency.
Reasoning
The CORK algorithm optimizes the tension between latency and throughput. It models batch processing time as linear (`o_a * n + o_b`) and uses this to predict whether waiting for more requests will improve or worsen total latency. During low traffic, it processes smaller batches quickly. During high traffic, it groups more requests together. The optimizer's training phase ensures it has calibrated parameters before making batching decisions in production.
The warning at `dispatcher.py:278-281` is triggered when `o_a + o_b >= max_latency`, meaning a single request already takes longer than the maximum allowed latency. This is a strong signal that `max_latency_ms` needs to be increased.
Code evidence from `dispatcher.py:278-281`:
if self.optimizer.o_a + self.optimizer.o_b >= self.max_latency:
logger.warning(
"BentoML has detected that a service has a max latency that is likely "
"too low for serving. If many 503 errors are encountered, try raising "
"the 'runner.max_latency' in your BentoML configuration YAML file."
)
Optimizer constants from `dispatcher.py:43-45`:
class Optimizer:
N_KEPT_SAMPLE = 50 # amount of outbound info kept for inferring params
N_SKIPPED_SAMPLE = 2 # amount of outbound info skipped after init
INTERVAL_REFRESH_PARAMS = 5 # seconds between each params refreshing
Default batching config from `v1/default_configuration.yaml`:
runners:
batching:
enabled: true
max_batch_size: 100
max_latency_ms: 60000