Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets DatasetDict

From Leeroopedia
Revision as of 12:58, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_DatasetDict.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for organizing multiple dataset splits into a single dictionary container provided by the HuggingFace Datasets library.

Description

DatasetDict is a dictionary subclass (dict[str | NamedSplit, Dataset]) that holds multiple dataset splits and provides dataset transformation methods (map, filter, sort, rename_column, etc.) that operate across all splits simultaneously. It validates that all contained values are Dataset instances and that all splits share the same Features schema. The class supports context manager protocol for resource cleanup, provides convenient access to shared properties (features, column_names, num_rows, etc.), and includes methods for serialization (save_to_disk, push_to_hub) and format conversion (set_format, with_format).

Usage

Use DatasetDict as the container for multi-split datasets. It is the default return type of load_dataset when no split is specified, and is the standard way to organize train/test/validation splits for Hub publishing.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/dataset_dict.py
  • Lines: 57-1983

Signature

class DatasetDict(dict[Union[str, NamedSplit], "Dataset"]):
    """A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)"""

Import

from datasets import DatasetDict

I/O Contract

Inputs

Name Type Required Description
*args / **kwargs dict[str, Dataset] Yes Mapping of split names to Dataset objects, same as a regular dict constructor.

Outputs

Name Type Description
instance DatasetDict A DatasetDict containing multiple named splits with shared transformation methods.

Usage Examples

Basic Usage

from datasets import Dataset, DatasetDict

train_ds = Dataset.from_dict({"text": ["Hello", "World"], "label": [1, 0]})
test_ds = Dataset.from_dict({"text": ["Test"], "label": [1]})

dataset_dict = DatasetDict({
    "train": train_ds,
    "test": test_ds,
})

print(dataset_dict)
# DatasetDict({
#     train: Dataset({ features: ['text', 'label'], num_rows: 2 })
#     test: Dataset({ features: ['text', 'label'], num_rows: 1 })
# })

# Apply transformations to all splits
dataset_dict = dataset_dict.rename_column("text", "sentence")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment