Principle:Datajuicer Data juicer Operator Type Selection
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Software_Architecture |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A class hierarchy pattern that defines distinct operator types for different data processing goals with prescribed abstract methods.
Description
Operator Type Selection is the design decision of choosing which base class to inherit when building a custom data processing operator. Data-Juicer defines four primary operator types, each with a specific contract: Mapper (transforms samples 1:1, implement process_single/process_batched), Filter (keeps or removes samples based on statistics, implement compute_stats + process_single), Deduplicator (removes duplicate samples, implement compute_hash + process), and Selector (selects a subset of samples, implement process at dataset level). Each type enforces a different processing paradigm through abstract methods.
Usage
Use this principle as the first step when developing a custom operator. The choice of base class determines which abstract methods must be implemented and how the operator interacts with the pipeline execution engine.
Theoretical Basis
# Abstract type hierarchy (NOT real implementation)
class OP: # Base: common config (text_key, num_proc, batch_size)
class Mapper(OP): # 1:1 transform: process_single(sample) -> sample
class Filter(OP): # Keep/remove: compute_stats(sample) + process_single(sample) -> bool
class Deduplicator(OP): # Remove dups: compute_hash(sample) + process(dataset) -> dataset
class Selector(OP): # Subset: process(dataset) -> dataset
Decision criteria:
- Transform data (add/modify fields) -> Mapper
- Quality gate (keep/remove by metric) -> Filter
- Remove duplicates -> Deduplicator
- Select subset (top-k, random sample) -> Selector