Principle:Huggingface Datasets DatasetDict Hub Upload
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Publishing multi-split datasets to the Hugging Face Hub uploads all splits (train, test, validation) in a single atomic operation with consistent metadata.
Description
Multi-split Hub upload extends single-split publishing to handle a DatasetDict containing multiple splits. Each split is serialized independently into Parquet shards, but all shards, metadata updates, and shard deletions are combined into a single atomic commit. This ensures the Hub repository transitions cleanly from one consistent state to another, without intermediate states where some splits are updated and others are not. The method validates that all splits have consistent features, computes aggregate size statistics, and generates a unified dataset card with split information for all splits.
Usage
Use multi-split Hub upload when you have a DatasetDict with multiple splits ready to publish. This is the preferred method for publishing complete datasets, as it ensures all splits are uploaded together with consistent metadata.
Theoretical Basis
The multi-split upload process extends the single-split pattern with two key additions: (1) feature consistency validation across splits before upload begins, and (2) aggregation of split-level metadata (sizes, row counts) into a single DatasetInfo. Each split is processed independently for Parquet serialization (allowing per-split shard counts), but the results are accumulated and committed atomically. The atomic commit is critical for data integrity: if any split fails to serialize, no partial upload pollutes the repository. The unified dataset card provides a single source of truth for the complete dataset structure.