Principle:Huggingface Datasets Framework Tensor Conversion
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Framework Tensor Conversion is the principle of configuring a HuggingFace Dataset so that its __getitem__ method returns data as ML framework tensors (PyTorch, TensorFlow, JAX, or NumPy) instead of plain Python objects.
Description
By default, accessing elements of a HuggingFace Dataset returns Python dicts with native Python types. The Framework Tensor Conversion principle introduces a format layer that intercepts __getitem__ calls and converts the underlying Arrow data into tensors of the chosen framework. The with_format method returns a new Dataset with the format applied on-the-fly, without copying or modifying the underlying data. Supported format types include "numpy", "torch", "tensorflow", "jax", "arrow", "pandas", and "polars". Column selection and output-all-columns options provide fine-grained control over which columns are converted.
Usage
Use Framework Tensor Conversion when you need to feed dataset rows or batches directly into model forward passes, loss computations, or other tensor operations. Setting the format avoids manual conversion code and ensures consistent dtype handling (e.g., integers default to int64, floats to float32).
Theoretical Basis
The conversion bridges the gap between Arrow's columnar binary representation and the tensor abstractions expected by ML frameworks. Under the hood, the format layer first extracts NumPy arrays from Arrow columns (a near-zero-copy operation for numeric types), then converts those arrays to the target framework's tensor type. Each framework formatter handles edge cases such as ragged arrays, string columns, PIL images, and video readers. The conversion is lazy: it happens only when data is accessed, keeping the underlying Arrow data immutable and shared.