Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datajuicer Data juicer Data Format Loading

From Leeroopedia
Domains Data_Processing, Data_Loading
Last Updated 2026-02-14 17:00 GMT

Overview

A registry-based format detection and loading pattern that provides format-specific dataset loaders with automatic format selection, unified output representation, and support for both local and remote data sources.

Pattern

The format loading system uses a FORMATTERS registry (separate from the OPERATORS registry) with a base class hierarchy and automatic format detection:

1. Base Class Hierarchy -- BaseFormatter defines the abstract load_dataset interface. LocalFormatter extends it for file/directory loading via HuggingFace's load_dataset, while RemoteFormatter handles HuggingFace Hub datasets.

2. Format-Specific Subclasses -- Each format (CSV, JSON, Parquet, TSV, Text) extends LocalFormatter with a SUFFIXES class attribute listing supported file extensions and a type attribute for the HuggingFace reader. Subclasses register themselves via @FORMATTERS.register_module().

3. Automatic Format Detection -- The load_formatter factory function scans the dataset path for files, scores each registered formatter by matching file extension count, and instantiates the best-matching formatter.

4. Format Unification -- The unify_format function standardizes all loaded datasets: validates text key presence, filters empty samples, wraps as NestedDataset, and converts relative media paths to absolute paths.

This pattern enables adding new format support by simply creating a new subclass with the appropriate SUFFIXES and registering it.

Key Characteristics

  • Separate FORMATTERS registry from operator OPERATORS registry
  • Convention over configuration: format auto-detected from file extensions
  • BaseFormatter -> LocalFormatter/RemoteFormatter class hierarchy
  • Format-specific subclasses define only SUFFIXES and type, inheriting all loading logic
  • Unified output: all formats produce NestedDataset with validated text fields
  • Supports local files, directories, and HuggingFace Hub datasets
  • Parallel loading via configurable num_proc parameter
  • Special handling for PDF and DOCX text extraction in TextFormatter

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment