Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset Push To Hub

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for publishing a single-split dataset to the Hugging Face Hub provided by the HuggingFace Datasets library.

Description

Dataset.push_to_hub pushes a dataset to the Hub as Parquet files using HTTP requests. It creates the repository if needed, serializes the data into Parquet shards (with configurable shard size), embeds external file bytes (images, audio, video) by default, generates or updates the dataset card with YAML metadata, cleans up old shards, and performs an atomic commit. The method supports multiple configurations (subsets), branching, pull request creation, and authentication via token. The split name defaults to the dataset's own split or "train" if unset.

Usage

Use Dataset.push_to_hub to publish a single-split dataset to the Hub. Call it with different split names to incrementally build a multi-split dataset.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: 5662-6071

Signature

def push_to_hub(
    self,
    repo_id: str,
    config_name: str = "default",
    set_default: Optional[bool] = None,
    split: Optional[str] = None,
    data_dir: Optional[str] = None,
    commit_message: Optional[str] = None,
    commit_description: Optional[str] = None,
    private: Optional[bool] = None,
    token: Optional[str] = None,
    revision: Optional[str] = None,
    create_pr: Optional[bool] = False,
    max_shard_size: Optional[Union[int, str]] = None,
    num_shards: Optional[int] = None,
    embed_external_files: bool = True,
    num_proc: Optional[int] = None,
) -> CommitInfo:

Import

from datasets import Dataset

I/O Contract

Inputs

Name Type Required Description
repo_id str Yes Repository ID in format <user>/<dataset_name> or <org>/<dataset_name>.
config_name str No Configuration (subset) name. Defaults to "default".
set_default bool No Whether to set this config as the default.
split str No Split name for the data. Defaults to dataset's split or "train".
data_dir str No Directory name for uploaded data files. Defaults based on config_name.
commit_message str No Commit message. Defaults to "Upload dataset".
commit_description str No Description for the commit or PR.
private bool No Whether the repo is private.
token str No Authentication token for the Hub.
revision str No Branch to push to. Defaults to "main".
create_pr bool No Whether to create a pull request. Defaults to False.
max_shard_size int or str No Maximum shard size (e.g., "500MB"). Mutually exclusive with num_shards.
num_shards int No Fixed number of shards. Mutually exclusive with max_shard_size.
embed_external_files bool No Whether to embed Image/Audio/Video file bytes. Defaults to True.
num_proc int No Number of processes for preparation and upload.

Outputs

Name Type Description
return huggingface_hub.CommitInfo Information about the commit that was created.

Usage Examples

Basic Usage

from datasets import Dataset

ds = Dataset.from_dict({"text": ["Hello", "World"], "label": [1, 0]})

# Push to Hub
ds.push_to_hub("my-username/my-dataset")

# Push with specific split and configuration
ds.push_to_hub("my-username/my-dataset", config_name="en", split="train")

# Push as private with shard size limit
ds.push_to_hub("my-username/my-dataset", private=True, max_shard_size="500MB")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment