Implementation:Huggingface Datasets DatasetDict Push To Hub

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for publishing multi-split datasets to the Hugging Face Hub provided by the HuggingFace Datasets library.

Description

DatasetDict.push_to_hub pushes all splits of a DatasetDict to the Hub as Parquet files in a single atomic commit. Each split is serialized into Parquet shards independently, but all additions, metadata updates, and old shard deletions are combined into one commit. The method validates feature consistency across splits, computes aggregate size statistics, generates or updates the dataset card YAML with configuration and split metadata, and supports per-split shard count configuration via a dictionary. The original split names from the DatasetDict keys are preserved on the Hub.

Usage

Use DatasetDict.push_to_hub to publish a complete multi-split dataset to the Hub in one operation. This is the standard method for sharing datasets that have train/test/validation splits.

Code Reference

Source Location

Repository: datasets
File: src/datasets/dataset_dict.py
Lines: 1616-1983

Signature

def push_to_hub(
    self,
    repo_id,
    config_name: str = "default",
    set_default: Optional[bool] = None,
    data_dir: Optional[str] = None,
    commit_message: Optional[str] = None,
    commit_description: Optional[str] = None,
    private: Optional[bool] = None,
    token: Optional[str] = None,
    revision: Optional[str] = None,
    create_pr: Optional[bool] = False,
    max_shard_size: Optional[Union[int, str]] = None,
    num_shards: Optional[dict[str, int]] = None,
    embed_external_files: bool = True,
    num_proc: Optional[int] = None,
) -> CommitInfo:

Import

from datasets import DatasetDict

I/O Contract

Inputs

Name	Type	Required	Description
repo_id	`str`	Yes	Repository ID in format `<user>/<dataset_name>` or `<org>/<dataset_name>`.
config_name	`str`	No	Configuration (subset) name. Defaults to "default".
set_default	`bool`	No	Whether to set this config as the default.
data_dir	`str`	No	Directory name for uploaded data files.
commit_message	`str`	No	Commit message. Defaults to "Upload dataset".
commit_description	`str`	No	Description for the commit or PR.
private	`bool`	No	Whether the repo is private.
token	`str`	No	Authentication token for the Hub.
revision	`str`	No	Branch to push to. Defaults to "main".
create_pr	`bool`	No	Whether to create a pull request. Defaults to False.
max_shard_size	`int or str`	No	Maximum shard size (e.g., "500MB").
num_shards	`dict[str, int]`	No	Per-split number of shards, e.g., `{"train": 128, "test": 4}`.
embed_external_files	`bool`	No	Whether to embed Image/Audio/Video file bytes. Defaults to True.
num_proc	`int`	No	Number of processes for preparation and upload.

Outputs

Name	Type	Description
return	`huggingface_hub.CommitInfo`	Information about the commit that was created.

Usage Examples

Basic Usage

from datasets import DatasetDict, Dataset

dataset_dict = DatasetDict({
    "train": Dataset.from_dict({"text": ["Hello", "World"], "label": [1, 0]}),
    "test": Dataset.from_dict({"text": ["Test"], "label": [1]}),
})

# Push all splits to Hub
dataset_dict.push_to_hub("my-username/my-dataset")

# Push with per-split shard configuration
dataset_dict.push_to_hub(
    "my-username/my-dataset",
    num_shards={"train": 1024, "test": 8},
)

# Push as a named configuration
dataset_dict.push_to_hub("my-username/my-dataset", config_name="en")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment