Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Datajuicer Data juicer Custom Operator Development

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Software_Development, LLM_Ops
Last Updated 2026-02-14 16:00 GMT

Overview

End-to-end process for building, registering, and deploying a custom data processing operator within the Data-Juicer framework.

Description

This workflow guides developers through creating a new operator (Filter, Mapper, Deduplicator, or Selector) for Data-Juicer. Operators are the fundamental processing units in Data-Juicer, each responsible for a single data transformation or quality check. The framework provides base classes with standardized interfaces, a registry system for automatic discovery, and support for both single-sample and batched processing modes. Custom operators can be added directly to the source tree or loaded externally via the custom_operator_paths configuration. The workflow covers defining statistics keys, implementing the operator class, registering it, adding it to the package exports, and using it in a pipeline configuration.

Usage

Execute this workflow when the built-in 200+ operators do not cover your specific data processing need. Common scenarios include domain-specific text filters (e.g., medical terminology validation), custom quality metrics, specialized data transformations, or integration with external APIs or models for data annotation.

Execution Steps

Step 1: Identify Operator Type

Determine which base class to inherit from based on the operation semantics. Filter operators compute statistics and make keep/reject decisions per sample. Mapper operators transform sample content (text, images, metadata). Deduplicator operators identify and remove duplicate samples. Selector operators choose a subset based on ranking criteria. Grouper operators partition samples into batches. Aggregator operators combine information across samples.

Key considerations:

  • Filter: Two-phase (compute_stats + process) for statistical quality gating
  • Mapper: Single-phase transformation of sample fields
  • Deduplicator: Requires hash computation and duplicate resolution
  • Selector: Operates on the entire dataset to choose subsets
  • Check the existing operator zoo first to avoid duplicating functionality

Step 2: Define Statistics Keys

If the new operator (especially a Filter) introduces new statistical metrics, register them as constants in the StatsKeys class. This provides unified management of all metric names used across operators and analysis.

Key considerations:

  • Add new keys to data_juicer/utils/constant.py in the StatsKeysConstant class
  • Use descriptive, lowercase, underscore-separated names
  • Statistics are stored in the __dj__stats__ field of each sample
  • This step is optional for Mappers that do not compute statistics

Step 3: Implement Operator Class

Create a new Python file in the appropriate operator subdirectory. The class must inherit from the correct base class, be decorated with @OPERATORS.register_module(name), and implement the required methods. For Filters, implement compute_stats_single and process_single. For Mappers, implement process_single or process_batched. The constructor defines configurable parameters with type hints for automatic CLI integration.

Key considerations:

  • Use @OPERATORS.register_module('operator_name') for registry registration
  • Constructor parameters become YAML configuration options automatically
  • The text_key attribute provides access to the primary text field
  • For batched processing, implement compute_stats_batched and process_batched
  • Sample data is a dictionary with field names as keys

Step 4: Register in Package

Add the new operator class to the __init__.py file in its operator subdirectory. Import the class and add it to the __all__ list. This ensures the operator is discoverable when the package is loaded.

Key considerations:

  • Import path follows the pattern from .filename import ClassName
  • The __all__ list controls what is publicly exported
  • For external operators, skip this step and use custom_operator_paths instead

Step 5: Add Dependency Mappings

If the operator requires additional Python packages beyond the core dependencies, add them to the operator dependency mapping. Data-Juicer supports per-operator isolated environments via uv for conflicting dependencies, and lazy loading to defer heavy imports.

Key considerations:

  • Declare dependencies in the operator class or environment specification
  • Use lazy loading (LazyLoader) for heavy optional dependencies
  • Per-operator uv environments handle conflicting dependency versions
  • The dj-install tool can install operator-specific dependencies

Step 6: Use in Pipeline Configuration

Add the operator to a YAML configuration file's process list using its registered name. Specify any custom parameters. Run the pipeline with dj-process --config config.yaml or programmatically via Python.

Key considerations:

  • The registered name (not class name) is used in YAML configuration
  • Parameters use the same names as constructor arguments
  • External operators require custom_operator_paths in the config
  • Test with a small dataset first before running on full data

Step 7: Test and Validate

Write unit tests following Data-Juicer's testing conventions. Create test cases in the tests/ops/ directory that verify the operator processes samples correctly, handles edge cases, and integrates with the pipeline execution framework.

Key considerations:

  • Follow the existing test patterns in tests/ops/mapper/ or tests/ops/filter/
  • Test both single-sample and batched processing modes
  • Verify statistics computation for Filter operators
  • Test edge cases: empty text, missing fields, unicode content

Execution Diagram

GitHub URL

Workflow Repository