Implementation:Huggingface Datasets DatasetDict Push To Hub
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for publishing multi-split datasets to the Hugging Face Hub provided by the HuggingFace Datasets library.
Description
DatasetDict.push_to_hub pushes all splits of a DatasetDict to the Hub as Parquet files in a single atomic commit. Each split is serialized into Parquet shards independently, but all additions, metadata updates, and old shard deletions are combined into one commit. The method validates feature consistency across splits, computes aggregate size statistics, generates or updates the dataset card YAML with configuration and split metadata, and supports per-split shard count configuration via a dictionary. The original split names from the DatasetDict keys are preserved on the Hub.
Usage
Use DatasetDict.push_to_hub to publish a complete multi-split dataset to the Hub in one operation. This is the standard method for sharing datasets that have train/test/validation splits.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/dataset_dict.py - Lines: 1616-1983
Signature
def push_to_hub(
self,
repo_id,
config_name: str = "default",
set_default: Optional[bool] = None,
data_dir: Optional[str] = None,
commit_message: Optional[str] = None,
commit_description: Optional[str] = None,
private: Optional[bool] = None,
token: Optional[str] = None,
revision: Optional[str] = None,
create_pr: Optional[bool] = False,
max_shard_size: Optional[Union[int, str]] = None,
num_shards: Optional[dict[str, int]] = None,
embed_external_files: bool = True,
num_proc: Optional[int] = None,
) -> CommitInfo:
Import
from datasets import DatasetDict
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| repo_id | str |
Yes | Repository ID in format <user>/<dataset_name> or <org>/<dataset_name>.
|
| config_name | str |
No | Configuration (subset) name. Defaults to "default". |
| set_default | bool |
No | Whether to set this config as the default. |
| data_dir | str |
No | Directory name for uploaded data files. |
| commit_message | str |
No | Commit message. Defaults to "Upload dataset". |
| commit_description | str |
No | Description for the commit or PR. |
| private | bool |
No | Whether the repo is private. |
| token | str |
No | Authentication token for the Hub. |
| revision | str |
No | Branch to push to. Defaults to "main". |
| create_pr | bool |
No | Whether to create a pull request. Defaults to False. |
| max_shard_size | int or str |
No | Maximum shard size (e.g., "500MB"). |
| num_shards | dict[str, int] |
No | Per-split number of shards, e.g., {"train": 128, "test": 4}.
|
| embed_external_files | bool |
No | Whether to embed Image/Audio/Video file bytes. Defaults to True. |
| num_proc | int |
No | Number of processes for preparation and upload. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | huggingface_hub.CommitInfo |
Information about the commit that was created. |
Usage Examples
Basic Usage
from datasets import DatasetDict, Dataset
dataset_dict = DatasetDict({
"train": Dataset.from_dict({"text": ["Hello", "World"], "label": [1, 0]}),
"test": Dataset.from_dict({"text": ["Test"], "label": [1]}),
})
# Push all splits to Hub
dataset_dict.push_to_hub("my-username/my-dataset")
# Push with per-split shard configuration
dataset_dict.push_to_hub(
"my-username/my-dataset",
num_shards={"train": 1024, "test": 8},
)
# Push as a named configuration
dataset_dict.push_to_hub("my-username/my-dataset", config_name="en")