Implementation:Huggingface Datasets DatasetDict

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for organizing multiple dataset splits into a single dictionary container provided by the HuggingFace Datasets library.

Description

DatasetDict is a dictionary subclass (dict[str | NamedSplit, Dataset]) that holds multiple dataset splits and provides dataset transformation methods (map, filter, sort, rename_column, etc.) that operate across all splits simultaneously. It validates that all contained values are Dataset instances and that all splits share the same Features schema. The class supports context manager protocol for resource cleanup, provides convenient access to shared properties (features, column_names, num_rows, etc.), and includes methods for serialization (save_to_disk, push_to_hub) and format conversion (set_format, with_format).

Usage

Use DatasetDict as the container for multi-split datasets. It is the default return type of load_dataset when no split is specified, and is the standard way to organize train/test/validation splits for Hub publishing.

Code Reference

Source Location

Repository: datasets
File: src/datasets/dataset_dict.py
Lines: 57-1983

Signature

class DatasetDict(dict[Union[str, NamedSplit], "Dataset"]):
    """A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)"""

Import

from datasets import DatasetDict

I/O Contract

Inputs

Name	Type	Required	Description
args / *kwargs	`dict[str, Dataset]`	Yes	Mapping of split names to Dataset objects, same as a regular dict constructor.

Outputs

Name	Type	Description
instance	`DatasetDict`	A DatasetDict containing multiple named splits with shared transformation methods.

Usage Examples

Basic Usage

from datasets import Dataset, DatasetDict

train_ds = Dataset.from_dict({"text": ["Hello", "World"], "label": [1, 0]})
test_ds = Dataset.from_dict({"text": ["Test"], "label": [1]})

dataset_dict = DatasetDict({
    "train": train_ds,
    "test": test_ds,
})

print(dataset_dict)
# DatasetDict({
#     train: Dataset({ features: ['text', 'label'], num_rows: 2 })
#     test: Dataset({ features: ['text', 'label'], num_rows: 1 })
# })

# Apply transformations to all splits
dataset_dict = dataset_dict.rename_column("text", "sentence")

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Multi_Split_Organization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment