Principle:Huggingface Datasets NumPy Formatting
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
NumPy Formatting is the principle of converting Arrow table data to NumPy arrays for use in numerical computation and as an intermediate step for other framework conversions.
Description
When a dataset's format is set to "numpy", the NumPy Formatting principle governs how Arrow columns are converted to np.ndarray objects. The conversion extracts arrays from Arrow, applies dtype defaults (int64 for integers, float32 for floats), and uses np.asarray() or np.array() to produce the result. String, bytes, and None values are returned as-is. PIL images are converted via np.asarray(). Lists of same-shape arrays are consolidated via np.stack(); variable-length lists are placed into an object-dtype array. NumPy formatting is also used internally as the foundation for the to_tf_dataset pipeline.
Usage
Use NumPy Formatting when you need raw NumPy arrays for scientific computing, custom preprocessing, or interoperation with libraries that expect NumPy inputs (scikit-learn, SciPy, matplotlib, etc.). It is also the format automatically applied when creating a tf.data.Dataset via to_tf_dataset.
Theoretical Basis
NumPy arrays are the standard in-memory representation for numerical data in the Python scientific computing ecosystem. Converting from Arrow to NumPy is highly efficient for numeric columns because Arrow stores data as contiguous typed buffers that can be zero-copy viewed as NumPy arrays. The formatter handles edge cases such as object-dtype arrays (used for variable-length nested data) by consolidating compatible sub-arrays where possible and falling back to object arrays otherwise.