Principle:Bentoml BentoML Async Execution Patterns
Overview
Async Execution Patterns address the challenge of maximizing throughput in multi-model pipelines by executing independent service calls concurrently. Using Python's async/await syntax and asyncio.gather(), BentoML enables parallel execution of dependency calls without blocking the event loop.
Detailed Explanation
In multi-model inference pipelines, not all operations are sequential. When two or more model calls are independent of each other -- meaning neither requires the output of the other -- they can be executed in parallel. This is a fundamental optimization for latency-sensitive ML serving.
The Problem with Synchronous Pipelines
In a synchronous pipeline with two independent models, total latency is the sum of both model latencies:
Synchronous: |-- Model A (100ms) --|-- Model B (150ms) --| = 250ms total
With async parallel execution, total latency is the maximum of the individual latencies:
Asynchronous: |-- Model A (100ms) --|
|-- Model B (150ms) -----| = 150ms total
This represents a 40% latency reduction in this example.
Key Concepts
.to_asyncProperty: BentoML'sDependencyclass provides a.to_asyncproperty that converts synchronous dependency method calls into async coroutines. This is the bridge between BentoML's dependency proxy and Python's async ecosystem.
asyncio.gather(): Python's built-in mechanism for running multiple coroutines concurrently. It schedules all coroutines to run on the event loop and waits for all of them to complete.
- Event Loop Integration: BentoML's serving infrastructure runs on an async event loop. By using async dependency calls, the server can process other requests while waiting for model inference to complete, improving overall throughput.
When to Use Async vs. Sync
| Pattern | Use Async | Use Sync |
|---|---|---|
| Independent parallel calls | Yes -- use asyncio.gather() |
No -- wastes time waiting sequentially |
| Sequential pipeline | Optional -- await each step |
Yes -- simpler code, same performance |
| CPU-bound preprocessing | No -- blocks the event loop | Yes -- or use run_in_executor
|
| I/O-bound inference | Yes -- maximizes concurrency | No -- blocks event loop |
Concurrency vs. Parallelism
It is important to distinguish between concurrency and parallelism in this context:
- Concurrency (what async provides): Multiple tasks make progress by interleaving execution. While one task waits for I/O (e.g., a model inference call to another process), another task can proceed.
- Parallelism (what multiple workers provide): Multiple tasks execute simultaneously on different CPU cores or GPU devices.
The .to_async pattern provides concurrency. True parallelism comes from BentoML's multi-worker architecture where each dependent service runs in its own process.
Error Handling in Async Pipelines
When using asyncio.gather(), if one coroutine raises an exception, the default behavior is to cancel all other coroutines and propagate the first exception. This can be controlled with the return_exceptions=True parameter to collect all results (including exceptions) and handle them individually.
Relationship to Implementation
This principle is implemented through the .to_async property on BentoML dependency proxies combined with standard asyncio.gather() usage patterns.
Implementation:Bentoml_BentoML_Async_Dependency_Execution