Implementation:Huggingface Datasets DatasetDict
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for organizing multiple dataset splits into a single dictionary container provided by the HuggingFace Datasets library.
Description
DatasetDict is a dictionary subclass (dict[str | NamedSplit, Dataset]) that holds multiple dataset splits and provides dataset transformation methods (map, filter, sort, rename_column, etc.) that operate across all splits simultaneously. It validates that all contained values are Dataset instances and that all splits share the same Features schema. The class supports context manager protocol for resource cleanup, provides convenient access to shared properties (features, column_names, num_rows, etc.), and includes methods for serialization (save_to_disk, push_to_hub) and format conversion (set_format, with_format).
Usage
Use DatasetDict as the container for multi-split datasets. It is the default return type of load_dataset when no split is specified, and is the standard way to organize train/test/validation splits for Hub publishing.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/dataset_dict.py - Lines: 57-1983
Signature
class DatasetDict(dict[Union[str, NamedSplit], "Dataset"]):
"""A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)"""
Import
from datasets import DatasetDict
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| *args / **kwargs | dict[str, Dataset] |
Yes | Mapping of split names to Dataset objects, same as a regular dict constructor. |
Outputs
| Name | Type | Description |
|---|---|---|
| instance | DatasetDict |
A DatasetDict containing multiple named splits with shared transformation methods. |
Usage Examples
Basic Usage
from datasets import Dataset, DatasetDict
train_ds = Dataset.from_dict({"text": ["Hello", "World"], "label": [1, 0]})
test_ds = Dataset.from_dict({"text": ["Test"], "label": [1]})
dataset_dict = DatasetDict({
"train": train_ds,
"test": test_ds,
})
print(dataset_dict)
# DatasetDict({
# train: Dataset({ features: ['text', 'label'], num_rows: 2 })
# test: Dataset({ features: ['text', 'label'], num_rows: 1 })
# })
# Apply transformations to all splits
dataset_dict = dataset_dict.rename_column("text", "sentence")