Workflow:Datajuicer Data juicer Custom Operator Development
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Software_Development, LLM_Ops |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
End-to-end process for building, registering, and deploying a custom data processing operator within the Data-Juicer framework.
Description
This workflow guides developers through creating a new operator (Filter, Mapper, Deduplicator, or Selector) for Data-Juicer. Operators are the fundamental processing units in Data-Juicer, each responsible for a single data transformation or quality check. The framework provides base classes with standardized interfaces, a registry system for automatic discovery, and support for both single-sample and batched processing modes. Custom operators can be added directly to the source tree or loaded externally via the custom_operator_paths configuration. The workflow covers defining statistics keys, implementing the operator class, registering it, adding it to the package exports, and using it in a pipeline configuration.
Usage
Execute this workflow when the built-in 200+ operators do not cover your specific data processing need. Common scenarios include domain-specific text filters (e.g., medical terminology validation), custom quality metrics, specialized data transformations, or integration with external APIs or models for data annotation.
Execution Steps
Step 1: Identify Operator Type
Determine which base class to inherit from based on the operation semantics. Filter operators compute statistics and make keep/reject decisions per sample. Mapper operators transform sample content (text, images, metadata). Deduplicator operators identify and remove duplicate samples. Selector operators choose a subset based on ranking criteria. Grouper operators partition samples into batches. Aggregator operators combine information across samples.
Key considerations:
- Filter: Two-phase (compute_stats + process) for statistical quality gating
- Mapper: Single-phase transformation of sample fields
- Deduplicator: Requires hash computation and duplicate resolution
- Selector: Operates on the entire dataset to choose subsets
- Check the existing operator zoo first to avoid duplicating functionality
Step 2: Define Statistics Keys
If the new operator (especially a Filter) introduces new statistical metrics, register them as constants in the StatsKeys class. This provides unified management of all metric names used across operators and analysis.
Key considerations:
- Add new keys to data_juicer/utils/constant.py in the StatsKeysConstant class
- Use descriptive, lowercase, underscore-separated names
- Statistics are stored in the __dj__stats__ field of each sample
- This step is optional for Mappers that do not compute statistics
Step 3: Implement Operator Class
Create a new Python file in the appropriate operator subdirectory. The class must inherit from the correct base class, be decorated with @OPERATORS.register_module(name), and implement the required methods. For Filters, implement compute_stats_single and process_single. For Mappers, implement process_single or process_batched. The constructor defines configurable parameters with type hints for automatic CLI integration.
Key considerations:
- Use @OPERATORS.register_module('operator_name') for registry registration
- Constructor parameters become YAML configuration options automatically
- The text_key attribute provides access to the primary text field
- For batched processing, implement compute_stats_batched and process_batched
- Sample data is a dictionary with field names as keys
Step 4: Register in Package
Add the new operator class to the __init__.py file in its operator subdirectory. Import the class and add it to the __all__ list. This ensures the operator is discoverable when the package is loaded.
Key considerations:
- Import path follows the pattern from .filename import ClassName
- The __all__ list controls what is publicly exported
- For external operators, skip this step and use custom_operator_paths instead
Step 5: Add Dependency Mappings
If the operator requires additional Python packages beyond the core dependencies, add them to the operator dependency mapping. Data-Juicer supports per-operator isolated environments via uv for conflicting dependencies, and lazy loading to defer heavy imports.
Key considerations:
- Declare dependencies in the operator class or environment specification
- Use lazy loading (LazyLoader) for heavy optional dependencies
- Per-operator uv environments handle conflicting dependency versions
- The dj-install tool can install operator-specific dependencies
Step 6: Use in Pipeline Configuration
Add the operator to a YAML configuration file's process list using its registered name. Specify any custom parameters. Run the pipeline with dj-process --config config.yaml or programmatically via Python.
Key considerations:
- The registered name (not class name) is used in YAML configuration
- Parameters use the same names as constructor arguments
- External operators require custom_operator_paths in the config
- Test with a small dataset first before running on full data
Step 7: Test and Validate
Write unit tests following Data-Juicer's testing conventions. Create test cases in the tests/ops/ directory that verify the operator processes samples correctly, handles edge cases, and integrates with the pipeline execution framework.
Key considerations:
- Follow the existing test patterns in tests/ops/mapper/ or tests/ops/filter/
- Test both single-sample and batched processing modes
- Verify statistics computation for Filter operators
- Test edge cases: empty text, missing fields, unicode content