Implementation:Huggingface Datasets Dataset Flatten
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Preprocessing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for flattening nested struct columns into top-level columns provided by the HuggingFace Datasets library.
Description
The flatten method creates a copy of the dataset where each column with a struct type is expanded into one column per struct field, using dot-separated names (e.g., "answers.text", "answers.answer_start"). Non-struct columns are left unchanged. The flattening is applied recursively up to a configurable maximum depth (default 16), meaning deeply nested structs are fully expanded. The operation works by repeatedly calling Apache Arrow's flatten on the underlying table until no struct columns remain or the maximum depth is reached.
Usage
Use Dataset.flatten when you have a dataset with nested struct columns and need to access individual sub-fields as independent top-level columns for operations like renaming, removing, or formatting.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L2034-L2079
Signature
@fingerprint_transform(inplace=False)
def flatten(self, new_fingerprint: Optional[str] = None, max_depth=16) -> "Dataset":
Import
from datasets import load_dataset
ds = load_dataset("rajpurkar/squad", split="train")
ds = ds.flatten()
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| new_fingerprint | Optional[str] |
No | The new fingerprint of the dataset after transform. If None, computed automatically.
|
| max_depth | int |
No | Maximum depth to which nested structs are flattened. Defaults to 16. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Dataset |
A copy of the dataset with flattened columns. Struct fields become top-level columns with dot-separated names. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("rajpurkar/squad", split="train")
print(ds.features)
# {'id': Value('string'), 'title': Value('string'), 'context': Value('string'),
# 'question': Value('string'),
# 'answers': {'text': Sequence(Value('string')), 'answer_start': Sequence(Value('int32'))}}
ds = ds.flatten()
print(ds.column_names)
# ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start']