Principle:Datajuicer Data juicer Data Format Loading
| Domains | Data_Processing, Data_Loading |
|---|---|
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A registry-based format detection and loading pattern that provides format-specific dataset loaders with automatic format selection, unified output representation, and support for both local and remote data sources.
Pattern
The format loading system uses a FORMATTERS registry (separate from the OPERATORS registry) with a base class hierarchy and automatic format detection:
1. Base Class Hierarchy -- BaseFormatter defines the abstract load_dataset interface. LocalFormatter extends it for file/directory loading via HuggingFace's load_dataset, while RemoteFormatter handles HuggingFace Hub datasets.
2. Format-Specific Subclasses -- Each format (CSV, JSON, Parquet, TSV, Text) extends LocalFormatter with a SUFFIXES class attribute listing supported file extensions and a type attribute for the HuggingFace reader. Subclasses register themselves via @FORMATTERS.register_module().
3. Automatic Format Detection -- The load_formatter factory function scans the dataset path for files, scores each registered formatter by matching file extension count, and instantiates the best-matching formatter.
4. Format Unification -- The unify_format function standardizes all loaded datasets: validates text key presence, filters empty samples, wraps as NestedDataset, and converts relative media paths to absolute paths.
This pattern enables adding new format support by simply creating a new subclass with the appropriate SUFFIXES and registering it.
Key Characteristics
- Separate FORMATTERS registry from operator OPERATORS registry
- Convention over configuration: format auto-detected from file extensions
- BaseFormatter -> LocalFormatter/RemoteFormatter class hierarchy
- Format-specific subclasses define only SUFFIXES and type, inheriting all loading logic
- Unified output: all formats produce NestedDataset with validated text fields
- Supports local files, directories, and HuggingFace Hub datasets
- Parallel loading via configurable num_proc parameter
- Special handling for PDF and DOCX text extraction in TextFormatter
Implementations
- Implementation:Datajuicer_Data_juicer_BaseFormatter
- Implementation:Datajuicer_Data_juicer_CsvFormatter
- Implementation:Datajuicer_Data_juicer_JsonFormatter
- Implementation:Datajuicer_Data_juicer_ParquetFormatter
- Implementation:Datajuicer_Data_juicer_TsvFormatter
- Implementation:Datajuicer_Data_juicer_TextFormatter
- Implementation:Datajuicer_Data_juicer_EmptyFormatter
- Implementation:Datajuicer_Data_juicer_Load_Formatter