Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Datasets Extension Module Mapping

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Mapping file extensions to dataset builder modules enables automatic format detection so users can load data files without explicitly specifying the format.

Description

The HuggingFace Datasets library supports a wide variety of data formats: CSV, JSON, Parquet, Arrow, text, XML, HDF5, WebDataset (tar), and folder-based formats for images, audio, video, PDFs, and NIfTI medical images. When a user points load_dataset at a directory or a set of data files, the library needs to determine which builder module to use for parsing.

Extension Module Mapping provides this automatic format detection by maintaining a registry that maps file extensions (e.g. ".csv", ".parquet", ".jsonl") to their corresponding builder module names and default configuration overrides. For example, ".tsv" maps to the "csv" module with {"sep": "\t"} as a default kwarg.

This mapping is also used in reverse (via _MODULE_TO_EXTENSIONS) to filter candidate data files when a specific module is already known, and to determine which files should be included during dataset discovery.

Usage

Extension Module Mapping is used implicitly whenever:

  • You call load_dataset with a local directory path and no explicit builder name -- the library scans the directory's file extensions to select the appropriate builder.
  • You call load_dataset with data_files -- the extensions of the specified files determine the builder.
  • The dataset_module_factory needs to infer the correct packaged module for a Hub repository that contains data files without a loading script.

Theoretical Basis

The mapping follows a registry pattern with two complementary structures:

  1. Extension-to-Module (_EXTENSION_TO_MODULE): A dictionary mapping file extension strings to tuples of (module_name, default_kwargs). This is the primary lookup used during format detection.
  2. Module-to-Extensions (_MODULE_TO_EXTENSIONS): The inverse mapping, constructed by iterating over the extension-to-module dictionary. Used for filtering data files when the module is already known.

The extension registry is built in layers:

  • Core formats are registered directly (CSV, JSON, Parquet, Arrow, text, XML, HDF5, tar, eval, lance).
  • Folder-based formats dynamically register their extensions by reading class-level EXTENSIONS attributes from ImageFolder, AudioFolder, VideoFolder, PdfFolder, and NiftiFolder. Both lowercase and uppercase variants are registered.
Lookup pattern:
  extension = get_file_extension(file_path)
  if extension in _EXTENSION_TO_MODULE:
      module_name, default_kwargs = _EXTENSION_TO_MODULE[extension]
      builder = load_builder(module_name, **default_kwargs)
  else:
      raise UnsupportedFormatError(extension)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment