Implementation:Huggingface Datasets Dataset Push To Hub
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for publishing a single-split dataset to the Hugging Face Hub provided by the HuggingFace Datasets library.
Description
Dataset.push_to_hub pushes a dataset to the Hub as Parquet files using HTTP requests. It creates the repository if needed, serializes the data into Parquet shards (with configurable shard size), embeds external file bytes (images, audio, video) by default, generates or updates the dataset card with YAML metadata, cleans up old shards, and performs an atomic commit. The method supports multiple configurations (subsets), branching, pull request creation, and authentication via token. The split name defaults to the dataset's own split or "train" if unset.
Usage
Use Dataset.push_to_hub to publish a single-split dataset to the Hub. Call it with different split names to incrementally build a multi-split dataset.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: 5662-6071
Signature
def push_to_hub(
self,
repo_id: str,
config_name: str = "default",
set_default: Optional[bool] = None,
split: Optional[str] = None,
data_dir: Optional[str] = None,
commit_message: Optional[str] = None,
commit_description: Optional[str] = None,
private: Optional[bool] = None,
token: Optional[str] = None,
revision: Optional[str] = None,
create_pr: Optional[bool] = False,
max_shard_size: Optional[Union[int, str]] = None,
num_shards: Optional[int] = None,
embed_external_files: bool = True,
num_proc: Optional[int] = None,
) -> CommitInfo:
Import
from datasets import Dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| repo_id | str |
Yes | Repository ID in format <user>/<dataset_name> or <org>/<dataset_name>.
|
| config_name | str |
No | Configuration (subset) name. Defaults to "default". |
| set_default | bool |
No | Whether to set this config as the default. |
| split | str |
No | Split name for the data. Defaults to dataset's split or "train". |
| data_dir | str |
No | Directory name for uploaded data files. Defaults based on config_name. |
| commit_message | str |
No | Commit message. Defaults to "Upload dataset". |
| commit_description | str |
No | Description for the commit or PR. |
| private | bool |
No | Whether the repo is private. |
| token | str |
No | Authentication token for the Hub. |
| revision | str |
No | Branch to push to. Defaults to "main". |
| create_pr | bool |
No | Whether to create a pull request. Defaults to False. |
| max_shard_size | int or str |
No | Maximum shard size (e.g., "500MB"). Mutually exclusive with num_shards. |
| num_shards | int |
No | Fixed number of shards. Mutually exclusive with max_shard_size. |
| embed_external_files | bool |
No | Whether to embed Image/Audio/Video file bytes. Defaults to True. |
| num_proc | int |
No | Number of processes for preparation and upload. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | huggingface_hub.CommitInfo |
Information about the commit that was created. |
Usage Examples
Basic Usage
from datasets import Dataset
ds = Dataset.from_dict({"text": ["Hello", "World"], "label": [1, 0]})
# Push to Hub
ds.push_to_hub("my-username/my-dataset")
# Push with specific split and configuration
ds.push_to_hub("my-username/my-dataset", config_name="en", split="train")
# Push as private with shard size limit
ds.push_to_hub("my-username/my-dataset", private=True, max_shard_size="500MB")