Workflow:Huggingface Datasets Hub Publishing
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, MLOps, Data_Sharing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
End-to-end process for creating datasets from local data sources and publishing them to the Hugging Face Hub for sharing, versioning, and community access.
Description
This workflow covers the complete journey from raw local data to a published dataset on the Hugging Face Hub. It includes constructing Dataset objects from Python dictionaries, pandas DataFrames, or generator functions, configuring dataset metadata (features, descriptions, citations), and pushing the dataset to the Hub as Parquet files. The push operation handles repository creation, authentication, data sharding for large datasets, embedding of external media files (images, audio, video), and dataset card generation. No git or git-lfs installation is required as uploads use direct HTTP API calls.
Usage
Execute this workflow when you have created a new dataset (from experiments, annotations, web scraping, data augmentation, etc.) and want to share it publicly or privately on the Hugging Face Hub. Also applicable when updating an existing Hub dataset with new data, adding new configurations or splits, or migrating datasets from other platforms to the Hub ecosystem.
Execution Steps
Step 1: Construct the Dataset Object
Create a Dataset from your raw data source. The library supports construction from Python dictionaries (from_dict), pandas DataFrames (from_pandas), Python generators (from_generator), and lists of record dictionaries (from_list). Choose the method that best matches your data format.
Key considerations:
- from_dict accepts a mapping of column names to lists of values
- from_pandas converts a DataFrame directly, preserving column types where possible
- from_generator supports lazy data generation for very large datasets
- from_list creates a dataset from a list of dictionaries (one per row)
- Specify a Features schema to enforce exact column types, especially for ClassLabel, Image, Audio, and other special types
Step 2: Define the Feature Schema
Specify the exact feature types for each column to ensure proper encoding, storage, and Hub compatibility. The feature system supports primitive types (int, float, string, bool), structured types (ClassLabel, Sequence, list), and media types (Image, Audio, Video, Pdf, Nifti) with custom encoding/decoding logic.
Key considerations:
- ClassLabel enables automatic label-to-integer encoding with a defined label set
- Image, Audio, and Video features handle file paths, bytes, and lazy decoding
- Sequence and List types support variable-length and fixed-length collections
- Features can be nested (e.g., Sequence of dicts, lists of Images)
- The schema is serialized as part of the dataset metadata on the Hub
Step 3: Organize Splits and Configurations
Structure the data into appropriate splits (train, validation, test) and optionally organize related variants as named configurations. Multiple splits can be managed together using DatasetDict, and multiple configurations allow different subsets or versions of the same dataset to coexist in one repository.
What happens:
- Create a DatasetDict mapping split names to Dataset objects
- Each configuration gets its own data directory in the repository
- One configuration can be designated as the default for load_dataset calls
- Split metadata is stored alongside the data for automatic discovery
Step 4: Configure Hub Metadata
Prepare dataset metadata including the dataset description, citation information, license, and any tags that help with discoverability. This metadata is stored in a dataset card (README.md) in the repository and displayed on the Hub page.
Key considerations:
- The DatasetInfo object holds description, citation, homepage, license, and features
- Tags and language information improve Hub searchability
- A well-structured dataset card follows the Hugging Face dataset card template
- Metadata can be set programmatically or via the DatasetInfo constructor
Step 5: Push to the Hugging Face Hub
Upload the dataset to the Hub using the push_to_hub method. This creates the repository if it does not exist, converts the data to Parquet format, shards large datasets into manageable files, embeds external media content, and commits all files with a single API call.
Key considerations:
- Authentication is handled via a Hub token (environment variable or explicit parameter)
- Data is uploaded as Parquet files (no git-lfs required)
- max_shard_size controls file splitting for large datasets (default 500MB per shard)
- embed_external_files converts file path references to inline bytes for portability
- create_pr=True creates a pull request instead of committing directly
- Private repositories require setting private=True during creation
- num_proc enables parallel Parquet writing for faster uploads
Step 6: Verify the Published Dataset
After publishing, verify the dataset is correctly accessible by loading it back from the Hub and inspecting the result. Check that all splits, configurations, features, and example data are intact and match the original.
Key considerations:
- Load the published dataset to verify round-trip integrity
- Check the Hub page for correct dataset card rendering
- Verify the Dataset Viewer shows expected preview data
- Test that configurations and splits are properly discoverable