Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bentoml BentoML IO Descriptor Pandas

From Leeroopedia
Knowledge Sources
Domains IO Descriptors, Pandas, API Specification
Last Updated 2026-02-13 15:00 GMT

Overview

The Pandas IO descriptor module provides PandasDataFrame and PandasSeries descriptors for BentoML services that accept or return pandas data structures, supporting JSON, Parquet, and CSV serialization formats with dtype/shape validation.

Description

This module defines two IO descriptor classes and supporting infrastructure for pandas-based data handling in BentoML services:

PandasDataFrame (IODescriptor["ext.PdDataFrame"]):

  • Multi-format serialization: Supports JSON (application/json), Parquet (application/vnd.apache.parquet), and CSV (text/csv) via the SerializationFormat enum. The format is inferred from the request Content-Type header, falling back to the configured default_format.
  • Orient configuration: Supports pandas JSON orient values: "split", "records", "index", "columns", and "values". Defaults to "records".
  • Column management: Supports optional column name specification and application via columns and apply_column_names parameters.
  • Validation: The validate_dataframe method checks column count when applying column names, and validates shape with optional enforcement.
  • gRPC protobuf support: Converts between pb.DataFrame messages and pandas DataFrames. Currently restricted to "columns" orient for protobuf. Uses a ThreadPoolExecutor with 10 workers to process column contents in parallel.
  • Arrow and Spark integration: Provides from_arrow/to_arrow for RecordBatch conversion and spark_schema for PySpark StructType generation.

PandasSeries (IODescriptor["ext.PdSeries"]):

  • Similar validation pattern: Validates dtype (with safe casting via np.can_cast) and shape with optional enforcement.
  • gRPC protobuf support: Maps between pb.Series messages and pandas Series using the numpy dtype-to-field maps from the numpy IO descriptor module. Rejects mixed-dtype Series for protobuf serialization.
  • Arrow and Spark integration: Converts between RecordBatch/Series and generates Spark schemas.

Supporting infrastructure:

  • SerializationFormat enum maps format names to MIME types.
  • get_parquet_engine() detects available parquet engine (pyarrow or fastparquet).
  • _openapi_types() converts pandas dtypes to OpenAPI type strings.
  • _dataframe_openapi_schema() and _series_openapi_schema() generate orient-aware OpenAPI schemas.

Usage

Use PandasDataFrame for tabular data inputs/outputs and PandasSeries for single-column data. These are specified as the input or output parameter in the @svc.api decorator.

Code Reference

Source Location

Signature

class PandasDataFrame(
    IODescriptor["ext.PdDataFrame"],
    descriptor_id="bentoml.io.PandasDataFrame",
    proto_fields=("dataframe",),
):
    def __init__(
        self,
        orient: ext.DataFrameOrient = "records",
        columns: list[str] | None = None,
        apply_column_names: bool = False,
        dtype: bool | ext.PdDTypeArg | None = None,
        enforce_dtype: bool = False,
        shape: tuple[int, ...] | None = None,
        enforce_shape: bool = False,
        default_format: t.Literal["json", "parquet", "csv"] = "json",
    ): ...

class PandasSeries(
    IODescriptor["ext.PdSeries"],
    descriptor_id="bentoml.io.PandasSeries",
    proto_fields=("series",),
):
    def __init__(
        self,
        orient: ext.SeriesOrient = "records",
        dtype: ext.PdDTypeArg | None = None,
        enforce_dtype: bool = False,
        shape: tuple[int, ...] | None = None,
        enforce_shape: bool = False,
    ): ...

class SerializationFormat(Enum):
    JSON = "application/json"
    PARQUET = "application/vnd.apache.parquet"
    CSV = "text/csv"

Import

from bentoml.io import PandasDataFrame
from bentoml.io import PandasSeries

I/O Contract

Inputs (PandasDataFrame)

Name Type Required Description
orient str No JSON orient format: "split", "records", "index", "columns", "values". Defaults to "records".
columns list[str] or None No Column names to apply to the incoming DataFrame.
apply_column_names bool No Whether to rename columns on incoming data. Requires columns to be set.
dtype bool, dict, or None No Data type specification. If bool, pandas infers. If dict, maps column names to dtypes.
enforce_dtype bool No If True, enforces the specified dtype. Defaults to False.
shape tuple[int, ...] or None No Expected shape for validation.
enforce_shape bool No If True, raises BadInput on shape mismatch. Defaults to False.
default_format "json", "parquet", or "csv" No Default serialization format. Defaults to "json".

Inputs (PandasSeries)

Name Type Required Description
orient str No JSON orient format. Defaults to "records".
dtype PdDTypeArg or None No Data type specification for the series.
enforce_dtype bool No If True, enforces the specified dtype. Defaults to False.
shape tuple[int, ...] or None No Expected shape for validation.
enforce_shape bool No If True, raises BadInput on shape mismatch. Defaults to False.

Outputs

Name Type Description
PdDataFrame pandas.DataFrame Validated DataFrame from request or serialized for response.
PdSeries pandas.Series Validated Series from request or serialized for response.

Usage Examples

from __future__ import annotations

import bentoml
import pandas as pd
import numpy as np
from bentoml.io import PandasDataFrame, PandasSeries

runner = bentoml.sklearn.get("sklearn_model_clf").to_runner()
svc = bentoml.legacy.Service("iris-classifier", runners=[runner])

# PandasDataFrame with from_sample
input_spec = PandasDataFrame.from_sample(pd.DataFrame(np.array([[5, 4, 3, 2]])))

@svc.api(input=input_spec, output=PandasDataFrame())
def predict(input_arr: pd.DataFrame) -> pd.DataFrame:
    res = runner.run(input_arr)
    return pd.DataFrame(res)


# PandasSeries usage
@svc.api(input=PandasSeries(), output=PandasSeries())
def predict_series(input_series: pd.Series) -> pd.Series:
    res = runner.run(input_series)
    return pd.Series(res)


# Using parquet format
@svc.api(
    input=PandasDataFrame(default_format="parquet"),
    output=PandasDataFrame(default_format="parquet"),
)
def predict_parquet(input_df: pd.DataFrame) -> pd.DataFrame:
    return runner.run(input_df)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment