Implementation:Huggingface Datasets Dataset Set Format
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Preprocessing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for setting the output format of a dataset in-place provided by the HuggingFace Datasets library.
Description
The set_format method configures the dataset's __getitem__ return format in-place. After calling this method, accessing dataset elements returns data in the specified framework format (e.g., NumPy arrays, PyTorch tensors, TensorFlow tensors, JAX arrays, Arrow tables, pandas DataFrames, or Polars DataFrames). The formatting is applied on-the-fly during data access and does not modify the underlying Arrow data. You can optionally restrict which columns are included in the formatted output. The format can be reset to Python objects by calling set_format() with no arguments or using reset_format().
Usage
Use Dataset.set_format when you want to configure a dataset for a training loop that expects a specific tensor format, and you want to set the format once on the existing dataset object without creating a copy.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L2574-L2651
Signature
@fingerprint_transform(inplace=True)
def set_format(
self,
type: Optional[str] = None,
columns: Optional[list] = None,
output_all_columns: bool = False,
**format_kwargs,
):
Import
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds.set_format(type="numpy", columns=["text", "label"])
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| type | Optional[str] |
No | Output format type: None (Python objects), "numpy", "torch", "tensorflow", "jax", "arrow", "pandas", or "polars". Defaults to None.
|
| columns | Optional[list] |
No | Columns to include in formatted output. None means all columns. Defaults to None.
|
| output_all_columns | bool |
No | Keep un-formatted columns in output as Python objects. Defaults to False.
|
| **format_kwargs | keyword arguments | No | Additional arguments passed to the conversion function (e.g., np.array, torch.tensor).
|
Outputs
| Name | Type | Description |
|---|---|---|
| return | None |
This method modifies the dataset in-place and returns nothing. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
# Set format to NumPy for specific columns
ds.set_format(type="numpy", columns=["text", "label"])
print(ds.format)
# {'type': 'numpy', 'format_kwargs': {}, 'columns': ['text', 'label'], 'output_all_columns': False}
# Reset to default Python objects
ds.reset_format()
print(ds.format["type"])
# None