Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Datasets DatasetDict Hub Upload

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Publishing multi-split datasets to the Hugging Face Hub uploads all splits (train, test, validation) in a single atomic operation with consistent metadata.

Description

Multi-split Hub upload extends single-split publishing to handle a DatasetDict containing multiple splits. Each split is serialized independently into Parquet shards, but all shards, metadata updates, and shard deletions are combined into a single atomic commit. This ensures the Hub repository transitions cleanly from one consistent state to another, without intermediate states where some splits are updated and others are not. The method validates that all splits have consistent features, computes aggregate size statistics, and generates a unified dataset card with split information for all splits.

Usage

Use multi-split Hub upload when you have a DatasetDict with multiple splits ready to publish. This is the preferred method for publishing complete datasets, as it ensures all splits are uploaded together with consistent metadata.

Theoretical Basis

The multi-split upload process extends the single-split pattern with two key additions: (1) feature consistency validation across splits before upload begins, and (2) aggregation of split-level metadata (sizes, row counts) into a single DatasetInfo. Each split is processed independently for Parquet serialization (allowing per-split shard counts), but the results are accumulated and committed atomically. The atomic commit is critical for data integrity: if any split fails to serialize, no partial upload pollutes the repository. The unified dataset card provides a single source of truth for the complete dataset structure.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment