Principle:Huggingface Datasets Dataset Hub Upload

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Publishing a single-split dataset to the Hugging Face Hub makes it accessible, version-controlled, and shareable through the Hub infrastructure.

Description

Dataset Hub upload is the process of converting a local Dataset into Parquet shards and uploading them to a Hugging Face Hub repository. The process handles repository creation (if needed), data serialization into self-contained Parquet files (with embedded images/audio/video bytes by default), shard size management, dataset card generation and updating (with YAML metadata for configurations and splits), and cleanup of old shards. The upload uses HTTP requests and does not require git or git-lfs to be installed. A single Dataset object represents one split; when pushing, the user specifies which split name the data corresponds to.

Usage

Use single-split Hub upload when you have a Dataset object (one split) that you want to publish to the Hub. This is the entry point for sharing individual splits, and can be called multiple times with different split names to build up a multi-split dataset on the Hub.

Theoretical Basis

The upload process follows a multi-phase commit pattern: (1) serialize data to Parquet shards using PyArrow, (2) upload shard files as commit additions, (3) compute and merge metadata (DatasetInfo, split sizes, dataset card YAML), (4) delete obsolete shards, and (5) perform an atomic commit with all additions, deletions, and metadata updates. This atomic commit ensures the Hub repository is never in an inconsistent state. The Parquet format is chosen because it is columnar, compressed, and supports schema metadata, making it ideal for efficient dataset storage and partial loading. External file embedding (for Image/Audio/Video features) makes Parquet files self-contained, eliminating dependencies on external file locations.

Related Pages

Implemented By

Implementation:Huggingface_Datasets_Dataset_Push_To_Hub

Uses Heuristic

Heuristic:Huggingface_Datasets_Parquet_Shard_Sizing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment