Implementation:Bentoml BentoML IO Descriptor Pandas

Knowledge Sources	Bentoml_BentoML
Domains	IO Descriptors, Pandas, API Specification
Last Updated	2026-02-13 15:00 GMT

Overview

The Pandas IO descriptor module provides PandasDataFrame and PandasSeries descriptors for BentoML services that accept or return pandas data structures, supporting JSON, Parquet, and CSV serialization formats with dtype/shape validation.

Description

This module defines two IO descriptor classes and supporting infrastructure for pandas-based data handling in BentoML services:

PandasDataFrame (IODescriptor["ext.PdDataFrame"]):

Multi-format serialization: Supports JSON (application/json), Parquet (application/vnd.apache.parquet), and CSV (text/csv) via the SerializationFormat enum. The format is inferred from the request Content-Type header, falling back to the configured default_format.
Orient configuration: Supports pandas JSON orient values: "split", "records", "index", "columns", and "values". Defaults to "records".
Column management: Supports optional column name specification and application via columns and apply_column_names parameters.
Validation: The validate_dataframe method checks column count when applying column names, and validates shape with optional enforcement.
gRPC protobuf support: Converts between pb.DataFrame messages and pandas DataFrames. Currently restricted to "columns" orient for protobuf. Uses a ThreadPoolExecutor with 10 workers to process column contents in parallel.
Arrow and Spark integration: Provides from_arrow/to_arrow for RecordBatch conversion and spark_schema for PySpark StructType generation.

PandasSeries (IODescriptor["ext.PdSeries"]):

Similar validation pattern: Validates dtype (with safe casting via np.can_cast) and shape with optional enforcement.
gRPC protobuf support: Maps between pb.Series messages and pandas Series using the numpy dtype-to-field maps from the numpy IO descriptor module. Rejects mixed-dtype Series for protobuf serialization.
Arrow and Spark integration: Converts between RecordBatch/Series and generates Spark schemas.

Supporting infrastructure:

SerializationFormat enum maps format names to MIME types.
get_parquet_engine() detects available parquet engine (pyarrow or fastparquet).
_openapi_types() converts pandas dtypes to OpenAPI type strings.
_dataframe_openapi_schema() and _series_openapi_schema() generate orient-aware OpenAPI schemas.

Usage

Use PandasDataFrame for tabular data inputs/outputs and PandasSeries for single-column data. These are specified as the input or output parameter in the @svc.api decorator.

Code Reference

Source Location

Repository: Bentoml_BentoML
File: src/bentoml/_internal/io_descriptors/pandas.py
Lines: 1-1230

Signature

class PandasDataFrame(
    IODescriptor["ext.PdDataFrame"],
    descriptor_id="bentoml.io.PandasDataFrame",
    proto_fields=("dataframe",),
):
    def __init__(
        self,
        orient: ext.DataFrameOrient = "records",
        columns: list[str] | None = None,
        apply_column_names: bool = False,
        dtype: bool | ext.PdDTypeArg | None = None,
        enforce_dtype: bool = False,
        shape: tuple[int, ...] | None = None,
        enforce_shape: bool = False,
        default_format: t.Literal["json", "parquet", "csv"] = "json",
    ): ...

class PandasSeries(
    IODescriptor["ext.PdSeries"],
    descriptor_id="bentoml.io.PandasSeries",
    proto_fields=("series",),
):
    def __init__(
        self,
        orient: ext.SeriesOrient = "records",
        dtype: ext.PdDTypeArg | None = None,
        enforce_dtype: bool = False,
        shape: tuple[int, ...] | None = None,
        enforce_shape: bool = False,
    ): ...

class SerializationFormat(Enum):
    JSON = "application/json"
    PARQUET = "application/vnd.apache.parquet"
    CSV = "text/csv"

Import

from bentoml.io import PandasDataFrame
from bentoml.io import PandasSeries

I/O Contract

Inputs (PandasDataFrame)

Name	Type	Required	Description
orient	str	No	JSON orient format: "split", "records", "index", "columns", "values". Defaults to "records".
columns	list[str] or None	No	Column names to apply to the incoming DataFrame.
apply_column_names	bool	No	Whether to rename columns on incoming data. Requires columns to be set.
dtype	bool, dict, or None	No	Data type specification. If bool, pandas infers. If dict, maps column names to dtypes.
enforce_dtype	bool	No	If True, enforces the specified dtype. Defaults to False.
shape	tuple[int, ...] or None	No	Expected shape for validation.
enforce_shape	bool	No	If True, raises BadInput on shape mismatch. Defaults to False.
default_format	"json", "parquet", or "csv"	No	Default serialization format. Defaults to "json".

Inputs (PandasSeries)

Name	Type	Required	Description
orient	str	No	JSON orient format. Defaults to "records".
dtype	PdDTypeArg or None	No	Data type specification for the series.
enforce_dtype	bool	No	If True, enforces the specified dtype. Defaults to False.
shape	tuple[int, ...] or None	No	Expected shape for validation.
enforce_shape	bool	No	If True, raises BadInput on shape mismatch. Defaults to False.

Outputs

Name	Type	Description
PdDataFrame	pandas.DataFrame	Validated DataFrame from request or serialized for response.
PdSeries	pandas.Series	Validated Series from request or serialized for response.

Usage Examples

from __future__ import annotations

import bentoml
import pandas as pd
import numpy as np
from bentoml.io import PandasDataFrame, PandasSeries

runner = bentoml.sklearn.get("sklearn_model_clf").to_runner()
svc = bentoml.legacy.Service("iris-classifier", runners=[runner])

# PandasDataFrame with from_sample
input_spec = PandasDataFrame.from_sample(pd.DataFrame(np.array([[5, 4, 3, 2]])))

@svc.api(input=input_spec, output=PandasDataFrame())
def predict(input_arr: pd.DataFrame) -> pd.DataFrame:
    res = runner.run(input_arr)
    return pd.DataFrame(res)


# PandasSeries usage
@svc.api(input=PandasSeries(), output=PandasSeries())
def predict_series(input_series: pd.Series) -> pd.Series:
    res = runner.run(input_series)
    return pd.Series(res)


# Using parquet format
@svc.api(
    input=PandasDataFrame(default_format="parquet"),
    output=PandasDataFrame(default_format="parquet"),
)
def predict_parquet(input_df: pd.DataFrame) -> pd.DataFrame:
    return runner.run(input_df)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment