Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets DatasetDict Push To Hub

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for publishing multi-split datasets to the Hugging Face Hub provided by the HuggingFace Datasets library.

Description

DatasetDict.push_to_hub pushes all splits of a DatasetDict to the Hub as Parquet files in a single atomic commit. Each split is serialized into Parquet shards independently, but all additions, metadata updates, and old shard deletions are combined into one commit. The method validates feature consistency across splits, computes aggregate size statistics, generates or updates the dataset card YAML with configuration and split metadata, and supports per-split shard count configuration via a dictionary. The original split names from the DatasetDict keys are preserved on the Hub.

Usage

Use DatasetDict.push_to_hub to publish a complete multi-split dataset to the Hub in one operation. This is the standard method for sharing datasets that have train/test/validation splits.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/dataset_dict.py
  • Lines: 1616-1983

Signature

def push_to_hub(
    self,
    repo_id,
    config_name: str = "default",
    set_default: Optional[bool] = None,
    data_dir: Optional[str] = None,
    commit_message: Optional[str] = None,
    commit_description: Optional[str] = None,
    private: Optional[bool] = None,
    token: Optional[str] = None,
    revision: Optional[str] = None,
    create_pr: Optional[bool] = False,
    max_shard_size: Optional[Union[int, str]] = None,
    num_shards: Optional[dict[str, int]] = None,
    embed_external_files: bool = True,
    num_proc: Optional[int] = None,
) -> CommitInfo:

Import

from datasets import DatasetDict

I/O Contract

Inputs

Name Type Required Description
repo_id str Yes Repository ID in format <user>/<dataset_name> or <org>/<dataset_name>.
config_name str No Configuration (subset) name. Defaults to "default".
set_default bool No Whether to set this config as the default.
data_dir str No Directory name for uploaded data files.
commit_message str No Commit message. Defaults to "Upload dataset".
commit_description str No Description for the commit or PR.
private bool No Whether the repo is private.
token str No Authentication token for the Hub.
revision str No Branch to push to. Defaults to "main".
create_pr bool No Whether to create a pull request. Defaults to False.
max_shard_size int or str No Maximum shard size (e.g., "500MB").
num_shards dict[str, int] No Per-split number of shards, e.g., {"train": 128, "test": 4}.
embed_external_files bool No Whether to embed Image/Audio/Video file bytes. Defaults to True.
num_proc int No Number of processes for preparation and upload.

Outputs

Name Type Description
return huggingface_hub.CommitInfo Information about the commit that was created.

Usage Examples

Basic Usage

from datasets import DatasetDict, Dataset

dataset_dict = DatasetDict({
    "train": Dataset.from_dict({"text": ["Hello", "World"], "label": [1, 0]}),
    "test": Dataset.from_dict({"text": ["Test"], "label": [1]}),
})

# Push all splits to Hub
dataset_dict.push_to_hub("my-username/my-dataset")

# Push with per-split shard configuration
dataset_dict.push_to_hub(
    "my-username/my-dataset",
    num_shards={"train": 1024, "test": 8},
)

# Push as a named configuration
dataset_dict.push_to_hub("my-username/my-dataset", config_name="en")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment