Principle:Bentoml BentoML API Endpoint Definition
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-13 15:00 GMT |
Overview
A design pattern for declaring HTTP API endpoints on a BentoML service using method decoration. The @bentoml.api decorator converts a Python method into an HTTP endpoint with automatic input/output serialization, type validation, and optional adaptive batching.
Description
API endpoint definition in BentoML transforms regular Python methods on a service class into network-accessible endpoints. The @bentoml.api decorator wraps each method in an APIMethod object that handles:
- Route registration -- each decorated method becomes an HTTP POST endpoint. The route defaults to the method name (e.g.,
/predict) but can be overridden via therouteparameter. - Serialization -- function arguments and return values are automatically serialized/deserialized using type hints. BentoML supports Python primitives, Pydantic models, NumPy arrays, Pandas DataFrames, and file uploads.
- Input/output validation -- when
input_specoroutput_specPydantic models are provided, BentoML enforces schema validation and generates OpenAPI documentation. - Adaptive batching -- when
batchable=True, BentoML collects individual requests and groups them into batches before calling the handler, improving throughput for GPU-bound models.
Both synchronous and asynchronous (async def) methods are supported. Async methods are preferred for I/O-bound workloads as they allow the event loop to handle concurrent requests efficiently.
Usage
Use the @bentoml.api decorator when you need to:
- Expose a model inference function as an HTTP endpoint.
- Enable adaptive batching to maximize GPU utilization.
- Define explicit input/output schemas using Pydantic for auto-generated API documentation.
- Customize the HTTP route path for an endpoint.
A typical usage:
import bentoml
import numpy as np
from pydantic import BaseModel
class InputData(BaseModel):
features: list[float]
class OutputData(BaseModel):
prediction: float
confidence: float
@bentoml.service
class PredictionService:
@bentoml.api(
input_spec=InputData,
output_spec=OutputData,
route="/v1/predict",
)
def predict(self, input_data: InputData) -> OutputData:
result = self.model.predict(np.array(input_data.features))
return OutputData(prediction=result[0], confidence=0.95)
Theoretical Basis
The API endpoint definition pattern applies the method interception paradigm: a decorator intercepts method calls to inject cross-cutting concerns (serialization, validation, batching) without modifying the business logic.
The abstract pattern is as follows:
API_ENDPOINT(method, config):
ROUTE REGISTRATION:
HTTP POST /{route or method.__name__}
-> Maps to method on the service instance
REQUEST LIFECYCLE:
1. RECEIVE raw HTTP request
2. DESERIALIZE body -> typed Python arguments
(using type hints, input_spec, or IODescriptor)
3. VALIDATE inputs against schema
4. IF batchable:
COLLECT requests until max_batch_size or max_latency_ms
CALL method(batched_inputs)
SPLIT outputs back to individual responses
ELSE:
CALL method(inputs)
5. SERIALIZE return value -> HTTP response body
6. SEND HTTP response
BATCHING (optional):
batch_dim : axis along which to concatenate inputs
max_batch_size : upper limit on batch size
max_latency_ms : maximum wait time before dispatching partial batch
Key theoretical properties:
- Transparency -- the method signature defines the API contract; the decorator adds infrastructure without changing the method body.
- Adaptive batching -- dynamically adjusts batch sizes based on incoming request rates and latency budgets, optimizing throughput without manual tuning.
- Schema-driven documentation -- Pydantic-based specs generate OpenAPI/Swagger documentation automatically.