Implementation:Huggingface Datasets Dataset Flatten

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, ML_Preprocessing
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for flattening nested struct columns into top-level columns provided by the HuggingFace Datasets library.

Description

The flatten method creates a copy of the dataset where each column with a struct type is expanded into one column per struct field, using dot-separated names (e.g., "answers.text", "answers.answer_start"). Non-struct columns are left unchanged. The flattening is applied recursively up to a configurable maximum depth (default 16), meaning deeply nested structs are fully expanded. The operation works by repeatedly calling Apache Arrow's flatten on the underlying table until no struct columns remain or the maximum depth is reached.

Usage

Use Dataset.flatten when you have a dataset with nested struct columns and need to access individual sub-fields as independent top-level columns for operations like renaming, removing, or formatting.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: L2034-L2079

Signature

@fingerprint_transform(inplace=False)
def flatten(self, new_fingerprint: Optional[str] = None, max_depth=16) -> "Dataset":

Import

from datasets import load_dataset

ds = load_dataset("rajpurkar/squad", split="train")
ds = ds.flatten()

I/O Contract

Inputs

Name	Type	Required	Description
new_fingerprint	`Optional[str]`	No	The new fingerprint of the dataset after transform. If `None`, computed automatically.
max_depth	`int`	No	Maximum depth to which nested structs are flattened. Defaults to 16.

Outputs

Name	Type	Description
return	`Dataset`	A copy of the dataset with flattened columns. Struct fields become top-level columns with dot-separated names.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("rajpurkar/squad", split="train")
print(ds.features)
# {'id': Value('string'), 'title': Value('string'), 'context': Value('string'),
#  'question': Value('string'),
#  'answers': {'text': Sequence(Value('string')), 'answer_start': Sequence(Value('int32'))}}

ds = ds.flatten()
print(ds.column_names)
# ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start']

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Struct_Flattening

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment