Principle:Datajuicer Data juicer Data Format Loading

Domains	Data_Processing, Data_Loading
Last Updated	2026-02-14 17:00 GMT

Overview

A registry-based format detection and loading pattern that provides format-specific dataset loaders with automatic format selection, unified output representation, and support for both local and remote data sources.

Pattern

The format loading system uses a FORMATTERS registry (separate from the OPERATORS registry) with a base class hierarchy and automatic format detection:

1. Base Class Hierarchy -- BaseFormatter defines the abstract load_dataset interface. LocalFormatter extends it for file/directory loading via HuggingFace's load_dataset, while RemoteFormatter handles HuggingFace Hub datasets.

2. Format-Specific Subclasses -- Each format (CSV, JSON, Parquet, TSV, Text) extends LocalFormatter with a SUFFIXES class attribute listing supported file extensions and a type attribute for the HuggingFace reader. Subclasses register themselves via @FORMATTERS.register_module().

3. Automatic Format Detection -- The load_formatter factory function scans the dataset path for files, scores each registered formatter by matching file extension count, and instantiates the best-matching formatter.

4. Format Unification -- The unify_format function standardizes all loaded datasets: validates text key presence, filters empty samples, wraps as NestedDataset, and converts relative media paths to absolute paths.

This pattern enables adding new format support by simply creating a new subclass with the appropriate SUFFIXES and registering it.

Key Characteristics

Separate FORMATTERS registry from operator OPERATORS registry
Convention over configuration: format auto-detected from file extensions
BaseFormatter -> LocalFormatter/RemoteFormatter class hierarchy
Format-specific subclasses define only SUFFIXES and type, inheriting all loading logic
Unified output: all formats produce NestedDataset with validated text fields
Supports local files, directories, and HuggingFace Hub datasets
Parallel loading via configurable num_proc parameter
Special handling for PDF and DOCX text extraction in TextFormatter

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment