Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset Flatten

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ML_Preprocessing
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for flattening nested struct columns into top-level columns provided by the HuggingFace Datasets library.

Description

The flatten method creates a copy of the dataset where each column with a struct type is expanded into one column per struct field, using dot-separated names (e.g., "answers.text", "answers.answer_start"). Non-struct columns are left unchanged. The flattening is applied recursively up to a configurable maximum depth (default 16), meaning deeply nested structs are fully expanded. The operation works by repeatedly calling Apache Arrow's flatten on the underlying table until no struct columns remain or the maximum depth is reached.

Usage

Use Dataset.flatten when you have a dataset with nested struct columns and need to access individual sub-fields as independent top-level columns for operations like renaming, removing, or formatting.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L2034-L2079

Signature

@fingerprint_transform(inplace=False)
def flatten(self, new_fingerprint: Optional[str] = None, max_depth=16) -> "Dataset":

Import

from datasets import load_dataset

ds = load_dataset("rajpurkar/squad", split="train")
ds = ds.flatten()

I/O Contract

Inputs

Name Type Required Description
new_fingerprint Optional[str] No The new fingerprint of the dataset after transform. If None, computed automatically.
max_depth int No Maximum depth to which nested structs are flattened. Defaults to 16.

Outputs

Name Type Description
return Dataset A copy of the dataset with flattened columns. Struct fields become top-level columns with dot-separated names.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("rajpurkar/squad", split="train")
print(ds.features)
# {'id': Value('string'), 'title': Value('string'), 'context': Value('string'),
#  'question': Value('string'),
#  'answers': {'text': Sequence(Value('string')), 'answer_start': Sequence(Value('int32'))}}

ds = ds.flatten()
print(ds.column_names)
# ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start']

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment