Principle:Bentoml BentoML Async Execution Patterns

Overview

Async Execution Patterns address the challenge of maximizing throughput in multi-model pipelines by executing independent service calls concurrently. Using Python's async/await syntax and asyncio.gather(), BentoML enables parallel execution of dependency calls without blocking the event loop.

Detailed Explanation

In multi-model inference pipelines, not all operations are sequential. When two or more model calls are independent of each other -- meaning neither requires the output of the other -- they can be executed in parallel. This is a fundamental optimization for latency-sensitive ML serving.

The Problem with Synchronous Pipelines

In a synchronous pipeline with two independent models, total latency is the sum of both model latencies:

Synchronous: |-- Model A (100ms) --|-- Model B (150ms) --| = 250ms total

With async parallel execution, total latency is the maximum of the individual latencies:

Asynchronous: |-- Model A (100ms) --|
              |-- Model B (150ms) -----| = 150ms total

This represents a 40% latency reduction in this example.

Key Concepts

.to_async Property: BentoML's Dependency class provides a .to_async property that converts synchronous dependency method calls into async coroutines. This is the bridge between BentoML's dependency proxy and Python's async ecosystem.

asyncio.gather(): Python's built-in mechanism for running multiple coroutines concurrently. It schedules all coroutines to run on the event loop and waits for all of them to complete.

Event Loop Integration: BentoML's serving infrastructure runs on an async event loop. By using async dependency calls, the server can process other requests while waiting for model inference to complete, improving overall throughput.

When to Use Async vs. Sync

Pattern	Use Async	Use Sync
Independent parallel calls	Yes -- use `asyncio.gather()`	No -- wastes time waiting sequentially
Sequential pipeline	Optional -- `await` each step	Yes -- simpler code, same performance
CPU-bound preprocessing	No -- blocks the event loop	Yes -- or use `run_in_executor`
I/O-bound inference	Yes -- maximizes concurrency	No -- blocks event loop

Concurrency vs. Parallelism

It is important to distinguish between concurrency and parallelism in this context:

Concurrency (what async provides): Multiple tasks make progress by interleaving execution. While one task waits for I/O (e.g., a model inference call to another process), another task can proceed.
Parallelism (what multiple workers provide): Multiple tasks execute simultaneously on different CPU cores or GPU devices.

The .to_async pattern provides concurrency. True parallelism comes from BentoML's multi-worker architecture where each dependent service runs in its own process.

Error Handling in Async Pipelines

When using asyncio.gather(), if one coroutine raises an exception, the default behavior is to cancel all other coroutines and propagate the first exception. This can be controlled with the return_exceptions=True parameter to collect all results (including exceptions) and handle them individually.

Relationship to Implementation

This principle is implemented through the .to_async property on BentoML dependency proxies combined with standard asyncio.gather() usage patterns.

Implementation:Bentoml_BentoML_Async_Dependency_Execution

Metadata

Knowledge Sources

2026-02-13 15:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment