Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Huggingface Datasets Hub Publishing

From Leeroopedia
Revision as of 11:04, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Huggingface_Datasets_Hub_Publishing.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, MLOps, Data_Sharing
Last Updated 2026-02-14 18:00 GMT

Overview

End-to-end process for creating datasets from local data sources and publishing them to the Hugging Face Hub for sharing, versioning, and community access.

Description

This workflow covers the complete journey from raw local data to a published dataset on the Hugging Face Hub. It includes constructing Dataset objects from Python dictionaries, pandas DataFrames, or generator functions, configuring dataset metadata (features, descriptions, citations), and pushing the dataset to the Hub as Parquet files. The push operation handles repository creation, authentication, data sharding for large datasets, embedding of external media files (images, audio, video), and dataset card generation. No git or git-lfs installation is required as uploads use direct HTTP API calls.

Usage

Execute this workflow when you have created a new dataset (from experiments, annotations, web scraping, data augmentation, etc.) and want to share it publicly or privately on the Hugging Face Hub. Also applicable when updating an existing Hub dataset with new data, adding new configurations or splits, or migrating datasets from other platforms to the Hub ecosystem.

Execution Steps

Step 1: Construct the Dataset Object

Create a Dataset from your raw data source. The library supports construction from Python dictionaries (from_dict), pandas DataFrames (from_pandas), Python generators (from_generator), and lists of record dictionaries (from_list). Choose the method that best matches your data format.

Key considerations:

  • from_dict accepts a mapping of column names to lists of values
  • from_pandas converts a DataFrame directly, preserving column types where possible
  • from_generator supports lazy data generation for very large datasets
  • from_list creates a dataset from a list of dictionaries (one per row)
  • Specify a Features schema to enforce exact column types, especially for ClassLabel, Image, Audio, and other special types

Step 2: Define the Feature Schema

Specify the exact feature types for each column to ensure proper encoding, storage, and Hub compatibility. The feature system supports primitive types (int, float, string, bool), structured types (ClassLabel, Sequence, list), and media types (Image, Audio, Video, Pdf, Nifti) with custom encoding/decoding logic.

Key considerations:

  • ClassLabel enables automatic label-to-integer encoding with a defined label set
  • Image, Audio, and Video features handle file paths, bytes, and lazy decoding
  • Sequence and List types support variable-length and fixed-length collections
  • Features can be nested (e.g., Sequence of dicts, lists of Images)
  • The schema is serialized as part of the dataset metadata on the Hub

Step 3: Organize Splits and Configurations

Structure the data into appropriate splits (train, validation, test) and optionally organize related variants as named configurations. Multiple splits can be managed together using DatasetDict, and multiple configurations allow different subsets or versions of the same dataset to coexist in one repository.

What happens:

  • Create a DatasetDict mapping split names to Dataset objects
  • Each configuration gets its own data directory in the repository
  • One configuration can be designated as the default for load_dataset calls
  • Split metadata is stored alongside the data for automatic discovery

Step 4: Configure Hub Metadata

Prepare dataset metadata including the dataset description, citation information, license, and any tags that help with discoverability. This metadata is stored in a dataset card (README.md) in the repository and displayed on the Hub page.

Key considerations:

  • The DatasetInfo object holds description, citation, homepage, license, and features
  • Tags and language information improve Hub searchability
  • A well-structured dataset card follows the Hugging Face dataset card template
  • Metadata can be set programmatically or via the DatasetInfo constructor

Step 5: Push to the Hugging Face Hub

Upload the dataset to the Hub using the push_to_hub method. This creates the repository if it does not exist, converts the data to Parquet format, shards large datasets into manageable files, embeds external media content, and commits all files with a single API call.

Key considerations:

  • Authentication is handled via a Hub token (environment variable or explicit parameter)
  • Data is uploaded as Parquet files (no git-lfs required)
  • max_shard_size controls file splitting for large datasets (default 500MB per shard)
  • embed_external_files converts file path references to inline bytes for portability
  • create_pr=True creates a pull request instead of committing directly
  • Private repositories require setting private=True during creation
  • num_proc enables parallel Parquet writing for faster uploads

Step 6: Verify the Published Dataset

After publishing, verify the dataset is correctly accessible by loading it back from the Hub and inspecting the result. Check that all splits, configurations, features, and example data are intact and match the original.

Key considerations:

  • Load the published dataset to verify round-trip integrity
  • Check the Hub page for correct dataset card rendering
  • Verify the Dataset Viewer shows expected preview data
  • Test that configurations and splits are properly discoverable

Execution Diagram

GitHub URL

Workflow Repository