Implementation:Huggingface Datasets Dataset Push To Hub

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for publishing a single-split dataset to the Hugging Face Hub provided by the HuggingFace Datasets library.

Description

Dataset.push_to_hub pushes a dataset to the Hub as Parquet files using HTTP requests. It creates the repository if needed, serializes the data into Parquet shards (with configurable shard size), embeds external file bytes (images, audio, video) by default, generates or updates the dataset card with YAML metadata, cleans up old shards, and performs an atomic commit. The method supports multiple configurations (subsets), branching, pull request creation, and authentication via token. The split name defaults to the dataset's own split or "train" if unset.

Usage

Use Dataset.push_to_hub to publish a single-split dataset to the Hub. Call it with different split names to incrementally build a multi-split dataset.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: 5662-6071

Signature

def push_to_hub(
    self,
    repo_id: str,
    config_name: str = "default",
    set_default: Optional[bool] = None,
    split: Optional[str] = None,
    data_dir: Optional[str] = None,
    commit_message: Optional[str] = None,
    commit_description: Optional[str] = None,
    private: Optional[bool] = None,
    token: Optional[str] = None,
    revision: Optional[str] = None,
    create_pr: Optional[bool] = False,
    max_shard_size: Optional[Union[int, str]] = None,
    num_shards: Optional[int] = None,
    embed_external_files: bool = True,
    num_proc: Optional[int] = None,
) -> CommitInfo:

Import

from datasets import Dataset

I/O Contract

Inputs

Name	Type	Required	Description
repo_id	`str`	Yes	Repository ID in format `<user>/<dataset_name>` or `<org>/<dataset_name>`.
config_name	`str`	No	Configuration (subset) name. Defaults to "default".
set_default	`bool`	No	Whether to set this config as the default.
split	`str`	No	Split name for the data. Defaults to dataset's split or "train".
data_dir	`str`	No	Directory name for uploaded data files. Defaults based on config_name.
commit_message	`str`	No	Commit message. Defaults to "Upload dataset".
commit_description	`str`	No	Description for the commit or PR.
private	`bool`	No	Whether the repo is private.
token	`str`	No	Authentication token for the Hub.
revision	`str`	No	Branch to push to. Defaults to "main".
create_pr	`bool`	No	Whether to create a pull request. Defaults to False.
max_shard_size	`int or str`	No	Maximum shard size (e.g., "500MB"). Mutually exclusive with num_shards.
num_shards	`int`	No	Fixed number of shards. Mutually exclusive with max_shard_size.
embed_external_files	`bool`	No	Whether to embed Image/Audio/Video file bytes. Defaults to True.
num_proc	`int`	No	Number of processes for preparation and upload.

Outputs

Name	Type	Description
return	`huggingface_hub.CommitInfo`	Information about the commit that was created.

Usage Examples

Basic Usage

from datasets import Dataset

ds = Dataset.from_dict({"text": ["Hello", "World"], "label": [1, 0]})

# Push to Hub
ds.push_to_hub("my-username/my-dataset")

# Push with specific split and configuration
ds.push_to_hub("my-username/my-dataset", config_name="en", split="train")

# Push as private with shard size limit
ds.push_to_hub("my-username/my-dataset", private=True, max_shard_size="500MB")

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Dataset_Hub_Upload

Requires Environment

Environment:Huggingface_Datasets_Python_PyArrow_Core

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment