Principle:Huggingface Datasets NumPy Formatting

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

NumPy Formatting is the principle of converting Arrow table data to NumPy arrays for use in numerical computation and as an intermediate step for other framework conversions.

Description

When a dataset's format is set to "numpy", the NumPy Formatting principle governs how Arrow columns are converted to np.ndarray objects. The conversion extracts arrays from Arrow, applies dtype defaults (int64 for integers, float32 for floats), and uses np.asarray() or np.array() to produce the result. String, bytes, and None values are returned as-is. PIL images are converted via np.asarray(). Lists of same-shape arrays are consolidated via np.stack(); variable-length lists are placed into an object-dtype array. NumPy formatting is also used internally as the foundation for the to_tf_dataset pipeline.

Usage

Use NumPy Formatting when you need raw NumPy arrays for scientific computing, custom preprocessing, or interoperation with libraries that expect NumPy inputs (scikit-learn, SciPy, matplotlib, etc.). It is also the format automatically applied when creating a tf.data.Dataset via to_tf_dataset.

Theoretical Basis

NumPy arrays are the standard in-memory representation for numerical data in the Python scientific computing ecosystem. Converting from Arrow to NumPy is highly efficient for numeric columns because Arrow stores data as contiguous typed buffers that can be zero-copy viewed as NumPy arrays. The formatter handles edge cases such as object-dtype arrays (used for variable-length nested data) by consolidating compatible sub-arrays where possible and falling back to object arrays otherwise.

Related Pages

Implemented By

Implementation:Huggingface_Datasets_NumpyFormatter

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment