Implementation:Datajuicer Data juicer Package Init Exports
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Software_Architecture |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Pattern documentation for registering new operators in their package __init__.py files in the Data-Juicer framework.
Description
Each operator type subdirectory (data_juicer/ops/filter/, data_juicer/ops/mapper/, etc.) has an __init__.py file that imports all operator classes. Adding a new operator requires adding an import line to the corresponding __init__.py. This triggers the @OPERATORS.register_module() decorator, registering the class in the global operator registry.
Usage
After creating a new operator file in the appropriate subdirectory, add an import for the new class in the package's __init__.py.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/ops/filter/__init__.py, data_juicer/ops/mapper/__init__.py, etc.
- Lines: L1-122 (filter), L1-220 (mapper)
Interface Specification
# data_juicer/ops/filter/__init__.py (pattern)
from .alphanumeric_filter import AlphanumericFilter
from .average_line_length_filter import AverageLineLengthFilter
from .character_repetition_filter import CharacterRepetitionFilter
# ... all existing filters
from .my_new_filter import MyNewFilter # Add for new operator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| import statement | Python code | Yes | Import of operator class from its module file |
Outputs
| Name | Type | Description |
|---|---|---|
| registration | Side effect | Operator class registered in OPERATORS.modules at import time |
Usage Examples
Adding a New Filter
# 1. Create data_juicer/ops/filter/my_quality_filter.py
from data_juicer.ops.base_op import OPERATORS, Filter
@OPERATORS.register_module('my_quality_filter')
class MyQualityFilter(Filter):
def compute_stats_single(self, sample, context=False):
# ... compute stats
return sample
def process_single(self, sample):
return sample['__dj__stats__']['my_metric'] > 0.5
# 2. Add to data_juicer/ops/filter/__init__.py:
# from .my_quality_filter import MyQualityFilter
# 3. Now usable in YAML:
# process:
# - my_quality_filter:
# min_score: 0.5