Principle:Huggingface Datasets DatasetDict Hub Upload

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Publishing multi-split datasets to the Hugging Face Hub uploads all splits (train, test, validation) in a single atomic operation with consistent metadata.

Description

Multi-split Hub upload extends single-split publishing to handle a DatasetDict containing multiple splits. Each split is serialized independently into Parquet shards, but all shards, metadata updates, and shard deletions are combined into a single atomic commit. This ensures the Hub repository transitions cleanly from one consistent state to another, without intermediate states where some splits are updated and others are not. The method validates that all splits have consistent features, computes aggregate size statistics, and generates a unified dataset card with split information for all splits.

Usage

Use multi-split Hub upload when you have a DatasetDict with multiple splits ready to publish. This is the preferred method for publishing complete datasets, as it ensures all splits are uploaded together with consistent metadata.

Theoretical Basis

The multi-split upload process extends the single-split pattern with two key additions: (1) feature consistency validation across splits before upload begins, and (2) aggregation of split-level metadata (sizes, row counts) into a single DatasetInfo. Each split is processed independently for Parquet serialization (allowing per-split shard counts), but the results are accumulated and committed atomically. The atomic commit is critical for data integrity: if any split fails to serialize, no partial upload pollutes the repository. The unified dataset card provides a single source of truth for the complete dataset structure.

Related Pages

Implemented By

Implementation:Huggingface_Datasets_DatasetDict_Push_To_Hub

Uses Heuristic

Heuristic:Huggingface_Datasets_Parquet_Shard_Sizing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment