Principle:Huggingface Datasets Extension Module Mapping
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Mapping file extensions to dataset builder modules enables automatic format detection so users can load data files without explicitly specifying the format.
Description
The HuggingFace Datasets library supports a wide variety of data formats: CSV, JSON, Parquet, Arrow, text, XML, HDF5, WebDataset (tar), and folder-based formats for images, audio, video, PDFs, and NIfTI medical images. When a user points load_dataset at a directory or a set of data files, the library needs to determine which builder module to use for parsing.
Extension Module Mapping provides this automatic format detection by maintaining a registry that maps file extensions (e.g. ".csv", ".parquet", ".jsonl") to their corresponding builder module names and default configuration overrides. For example, ".tsv" maps to the "csv" module with {"sep": "\t"} as a default kwarg.
This mapping is also used in reverse (via _MODULE_TO_EXTENSIONS) to filter candidate data files when a specific module is already known, and to determine which files should be included during dataset discovery.
Usage
Extension Module Mapping is used implicitly whenever:
- You call
load_datasetwith a local directory path and no explicit builder name -- the library scans the directory's file extensions to select the appropriate builder. - You call
load_datasetwithdata_files-- the extensions of the specified files determine the builder. - The
dataset_module_factoryneeds to infer the correct packaged module for a Hub repository that contains data files without a loading script.
Theoretical Basis
The mapping follows a registry pattern with two complementary structures:
- Extension-to-Module (
_EXTENSION_TO_MODULE): A dictionary mapping file extension strings to tuples of(module_name, default_kwargs). This is the primary lookup used during format detection. - Module-to-Extensions (
_MODULE_TO_EXTENSIONS): The inverse mapping, constructed by iterating over the extension-to-module dictionary. Used for filtering data files when the module is already known.
The extension registry is built in layers:
- Core formats are registered directly (CSV, JSON, Parquet, Arrow, text, XML, HDF5, tar, eval, lance).
- Folder-based formats dynamically register their extensions by reading class-level
EXTENSIONSattributes fromImageFolder,AudioFolder,VideoFolder,PdfFolder, andNiftiFolder. Both lowercase and uppercase variants are registered.
Lookup pattern:
extension = get_file_extension(file_path)
if extension in _EXTENSION_TO_MODULE:
module_name, default_kwargs = _EXTENSION_TO_MODULE[extension]
builder = load_builder(module_name, **default_kwargs)
else:
raise UnsupportedFormatError(extension)