Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Datajuicer Data juicer Operator Type Selection

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Software_Architecture
Last Updated 2026-02-14 17:00 GMT

Overview

A class hierarchy pattern that defines distinct operator types for different data processing goals with prescribed abstract methods.

Description

Operator Type Selection is the design decision of choosing which base class to inherit when building a custom data processing operator. Data-Juicer defines four primary operator types, each with a specific contract: Mapper (transforms samples 1:1, implement process_single/process_batched), Filter (keeps or removes samples based on statistics, implement compute_stats + process_single), Deduplicator (removes duplicate samples, implement compute_hash + process), and Selector (selects a subset of samples, implement process at dataset level). Each type enforces a different processing paradigm through abstract methods.

Usage

Use this principle as the first step when developing a custom operator. The choice of base class determines which abstract methods must be implemented and how the operator interacts with the pipeline execution engine.

Theoretical Basis

# Abstract type hierarchy (NOT real implementation)
class OP:          # Base: common config (text_key, num_proc, batch_size)
class Mapper(OP):  # 1:1 transform: process_single(sample) -> sample
class Filter(OP):  # Keep/remove: compute_stats(sample) + process_single(sample) -> bool
class Deduplicator(OP):  # Remove dups: compute_hash(sample) + process(dataset) -> dataset
class Selector(OP):      # Subset: process(dataset) -> dataset

Decision criteria:

  • Transform data (add/modify fields) -> Mapper
  • Quality gate (keep/remove by metric) -> Filter
  • Remove duplicates -> Deduplicator
  • Select subset (top-k, random sample) -> Selector

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment