Implementation:Bentoml BentoML IO Descriptor Pandas
| Knowledge Sources | |
|---|---|
| Domains | IO Descriptors, Pandas, API Specification |
| Last Updated | 2026-02-13 15:00 GMT |
Overview
The Pandas IO descriptor module provides PandasDataFrame and PandasSeries descriptors for BentoML services that accept or return pandas data structures, supporting JSON, Parquet, and CSV serialization formats with dtype/shape validation.
Description
This module defines two IO descriptor classes and supporting infrastructure for pandas-based data handling in BentoML services:
PandasDataFrame (IODescriptor["ext.PdDataFrame"]):
- Multi-format serialization: Supports JSON (
application/json), Parquet (application/vnd.apache.parquet), and CSV (text/csv) via theSerializationFormatenum. The format is inferred from the requestContent-Typeheader, falling back to the configureddefault_format. - Orient configuration: Supports pandas JSON orient values: "split", "records", "index", "columns", and "values". Defaults to "records".
- Column management: Supports optional column name specification and application via
columnsandapply_column_namesparameters. - Validation: The
validate_dataframemethod checks column count when applying column names, and validates shape with optional enforcement. - gRPC protobuf support: Converts between
pb.DataFramemessages and pandas DataFrames. Currently restricted to "columns" orient for protobuf. Uses aThreadPoolExecutorwith 10 workers to process column contents in parallel. - Arrow and Spark integration: Provides
from_arrow/to_arrowfor RecordBatch conversion andspark_schemafor PySpark StructType generation.
PandasSeries (IODescriptor["ext.PdSeries"]):
- Similar validation pattern: Validates dtype (with safe casting via
np.can_cast) and shape with optional enforcement. - gRPC protobuf support: Maps between
pb.Seriesmessages and pandas Series using the numpy dtype-to-field maps from the numpy IO descriptor module. Rejects mixed-dtype Series for protobuf serialization. - Arrow and Spark integration: Converts between RecordBatch/Series and generates Spark schemas.
Supporting infrastructure:
SerializationFormatenum maps format names to MIME types.get_parquet_engine()detects available parquet engine (pyarrow or fastparquet)._openapi_types()converts pandas dtypes to OpenAPI type strings._dataframe_openapi_schema()and_series_openapi_schema()generate orient-aware OpenAPI schemas.
Usage
Use PandasDataFrame for tabular data inputs/outputs and PandasSeries for single-column data. These are specified as the input or output parameter in the @svc.api decorator.
Code Reference
Source Location
- Repository: Bentoml_BentoML
- File: src/bentoml/_internal/io_descriptors/pandas.py
- Lines: 1-1230
Signature
class PandasDataFrame(
IODescriptor["ext.PdDataFrame"],
descriptor_id="bentoml.io.PandasDataFrame",
proto_fields=("dataframe",),
):
def __init__(
self,
orient: ext.DataFrameOrient = "records",
columns: list[str] | None = None,
apply_column_names: bool = False,
dtype: bool | ext.PdDTypeArg | None = None,
enforce_dtype: bool = False,
shape: tuple[int, ...] | None = None,
enforce_shape: bool = False,
default_format: t.Literal["json", "parquet", "csv"] = "json",
): ...
class PandasSeries(
IODescriptor["ext.PdSeries"],
descriptor_id="bentoml.io.PandasSeries",
proto_fields=("series",),
):
def __init__(
self,
orient: ext.SeriesOrient = "records",
dtype: ext.PdDTypeArg | None = None,
enforce_dtype: bool = False,
shape: tuple[int, ...] | None = None,
enforce_shape: bool = False,
): ...
class SerializationFormat(Enum):
JSON = "application/json"
PARQUET = "application/vnd.apache.parquet"
CSV = "text/csv"
Import
from bentoml.io import PandasDataFrame
from bentoml.io import PandasSeries
I/O Contract
Inputs (PandasDataFrame)
| Name | Type | Required | Description |
|---|---|---|---|
| orient | str | No | JSON orient format: "split", "records", "index", "columns", "values". Defaults to "records". |
| columns | list[str] or None | No | Column names to apply to the incoming DataFrame. |
| apply_column_names | bool | No | Whether to rename columns on incoming data. Requires columns to be set. |
| dtype | bool, dict, or None | No | Data type specification. If bool, pandas infers. If dict, maps column names to dtypes. |
| enforce_dtype | bool | No | If True, enforces the specified dtype. Defaults to False. |
| shape | tuple[int, ...] or None | No | Expected shape for validation. |
| enforce_shape | bool | No | If True, raises BadInput on shape mismatch. Defaults to False. |
| default_format | "json", "parquet", or "csv" | No | Default serialization format. Defaults to "json". |
Inputs (PandasSeries)
| Name | Type | Required | Description |
|---|---|---|---|
| orient | str | No | JSON orient format. Defaults to "records". |
| dtype | PdDTypeArg or None | No | Data type specification for the series. |
| enforce_dtype | bool | No | If True, enforces the specified dtype. Defaults to False. |
| shape | tuple[int, ...] or None | No | Expected shape for validation. |
| enforce_shape | bool | No | If True, raises BadInput on shape mismatch. Defaults to False. |
Outputs
| Name | Type | Description |
|---|---|---|
| PdDataFrame | pandas.DataFrame | Validated DataFrame from request or serialized for response. |
| PdSeries | pandas.Series | Validated Series from request or serialized for response. |
Usage Examples
from __future__ import annotations
import bentoml
import pandas as pd
import numpy as np
from bentoml.io import PandasDataFrame, PandasSeries
runner = bentoml.sklearn.get("sklearn_model_clf").to_runner()
svc = bentoml.legacy.Service("iris-classifier", runners=[runner])
# PandasDataFrame with from_sample
input_spec = PandasDataFrame.from_sample(pd.DataFrame(np.array([[5, 4, 3, 2]])))
@svc.api(input=input_spec, output=PandasDataFrame())
def predict(input_arr: pd.DataFrame) -> pd.DataFrame:
res = runner.run(input_arr)
return pd.DataFrame(res)
# PandasSeries usage
@svc.api(input=PandasSeries(), output=PandasSeries())
def predict_series(input_series: pd.Series) -> pd.Series:
res = runner.run(input_series)
return pd.Series(res)
# Using parquet format
@svc.api(
input=PandasDataFrame(default_format="parquet"),
output=PandasDataFrame(default_format="parquet"),
)
def predict_parquet(input_df: pd.DataFrame) -> pd.DataFrame:
return runner.run(input_df)