Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer Package Init Exports

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Software_Architecture
Last Updated 2026-02-14 17:00 GMT

Overview

Pattern documentation for registering new operators in their package __init__.py files in the Data-Juicer framework.

Description

Each operator type subdirectory (data_juicer/ops/filter/, data_juicer/ops/mapper/, etc.) has an __init__.py file that imports all operator classes. Adding a new operator requires adding an import line to the corresponding __init__.py. This triggers the @OPERATORS.register_module() decorator, registering the class in the global operator registry.

Usage

After creating a new operator file in the appropriate subdirectory, add an import for the new class in the package's __init__.py.

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/ops/filter/__init__.py, data_juicer/ops/mapper/__init__.py, etc.
  • Lines: L1-122 (filter), L1-220 (mapper)

Interface Specification

# data_juicer/ops/filter/__init__.py (pattern)
from .alphanumeric_filter import AlphanumericFilter
from .average_line_length_filter import AverageLineLengthFilter
from .character_repetition_filter import CharacterRepetitionFilter
# ... all existing filters
from .my_new_filter import MyNewFilter  # Add for new operator

I/O Contract

Inputs

Name Type Required Description
import statement Python code Yes Import of operator class from its module file

Outputs

Name Type Description
registration Side effect Operator class registered in OPERATORS.modules at import time

Usage Examples

Adding a New Filter

# 1. Create data_juicer/ops/filter/my_quality_filter.py
from data_juicer.ops.base_op import OPERATORS, Filter

@OPERATORS.register_module('my_quality_filter')
class MyQualityFilter(Filter):
    def compute_stats_single(self, sample, context=False):
        # ... compute stats
        return sample

    def process_single(self, sample):
        return sample['__dj__stats__']['my_metric'] > 0.5

# 2. Add to data_juicer/ops/filter/__init__.py:
# from .my_quality_filter import MyQualityFilter

# 3. Now usable in YAML:
# process:
#   - my_quality_filter:
#       min_score: 0.5

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment