Implementation:Huggingface Datasets NumpyFormatter

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for converting Arrow table data to NumPy arrays provided by the HuggingFace Datasets library.

Description

NumpyFormatter is a formatting class that extends TensorFormatter and converts Arrow data to np.ndarray objects. It provides three main formatting methods: format_row (single example), format_column (single column), and format_batch (batch of examples). The conversion process extracts NumPy arrays from Arrow, then recursively tensorizes the data structure using np.asarray(), applying default dtypes (int64 for integers, float32 for floats). Lists of same-shaped arrays are consolidated via np.stack(); variable-length lists are placed into an object-dtype array. String, bytes, None, and character values are passed through unchanged. PyTorch tensors are gracefully handled by detaching and converting through NumPy.

Usage

NumpyFormatter is typically not instantiated directly by users. It is automatically selected when Dataset.with_format("numpy") or Dataset.set_format("numpy") is called. It is also used internally by the to_tf_dataset pipeline.

Code Reference

Source Location

Repository: datasets
File: src/datasets/formatting/np_formatter.py
Lines: L26-L117

Signature

class NumpyFormatter(TensorFormatter[Mapping, np.ndarray, Mapping]):
    def __init__(self, features=None, token_per_repo_id=None, **np_array_kwargs):

    def _consolidate(self, column):
    def _tensorize(self, value):
    def _recursive_tensorize(self, data_struct):
    def recursive_tensorize(self, data_struct: dict):
    def format_row(self, pa_table: pa.Table) -> Mapping:
    def format_column(self, pa_table: pa.Table) -> np.ndarray:
    def format_batch(self, pa_table: pa.Table) -> Mapping:

Import

from datasets.formatting.np_formatter import NumpyFormatter

I/O Contract

Inputs

Name	Type	Required	Description
features	`Optional[Features]`	No	Dataset features for decoding special types (e.g., Image, Audio).
token_per_repo_id	`Optional[dict]`	No	Authentication tokens for accessing private repositories.
**np_array_kwargs		No	Additional keyword arguments forwarded to np.asarray() (e.g., dtype).
pa_table	`pa.Table`	Yes (for format methods)	The Arrow table to convert. Passed to format_row, format_column, or format_batch.

Outputs

Name	Type	Description
row	`Mapping`	A dict mapping column names to NumPy array values (from format_row).
column	`np.ndarray`	A single NumPy array for the column (from format_column).
batch	`Mapping`	A dict mapping column names to batched NumPy array values (from format_batch).

Usage Examples

Basic Usage

from datasets import load_dataset

# NumpyFormatter is used automatically when format is "numpy"
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
ds = ds.with_format("numpy")

# Accessing a row returns NumPy arrays
row = ds[0]
print(type(row["label"]))  # <class 'numpy.int64'>

# Accessing a batch returns stacked NumPy arrays
batch = ds[:8]
print(batch["label"].shape)  # (8,)
print(batch["label"].dtype)  # int64

Related Pages

Implements Principle

Principle:Huggingface_Datasets_NumPy_Formatting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment