Principle:Datajuicer Data juicer Operator Type Selection

Knowledge Sources	Data-Juicer
Domains	Data_Engineering, Software_Architecture
Last Updated	2026-02-14 17:00 GMT

Overview

A class hierarchy pattern that defines distinct operator types for different data processing goals with prescribed abstract methods.

Description

Operator Type Selection is the design decision of choosing which base class to inherit when building a custom data processing operator. Data-Juicer defines four primary operator types, each with a specific contract: Mapper (transforms samples 1:1, implement process_single/process_batched), Filter (keeps or removes samples based on statistics, implement compute_stats + process_single), Deduplicator (removes duplicate samples, implement compute_hash + process), and Selector (selects a subset of samples, implement process at dataset level). Each type enforces a different processing paradigm through abstract methods.

Usage

Use this principle as the first step when developing a custom operator. The choice of base class determines which abstract methods must be implemented and how the operator interacts with the pipeline execution engine.

Theoretical Basis

# Abstract type hierarchy (NOT real implementation)
class OP:          # Base: common config (text_key, num_proc, batch_size)
class Mapper(OP):  # 1:1 transform: process_single(sample) -> sample
class Filter(OP):  # Keep/remove: compute_stats(sample) + process_single(sample) -> bool
class Deduplicator(OP):  # Remove dups: compute_hash(sample) + process(dataset) -> dataset
class Selector(OP):      # Subset: process(dataset) -> dataset

Decision criteria:

Transform data (add/modify fields) -> Mapper
Quality gate (keep/remove by metric) -> Filter
Remove duplicates -> Deduplicator
Select subset (top-k, random sample) -> Selector

Related Pages

Implemented By

Implementation:Datajuicer_Data_juicer_Operator_Base_Classes

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment