Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datajuicer Data juicer Data Mapping Transformation

From Leeroopedia
Domains Data_Processing, Data_Cleaning
Last Updated 2026-02-14 17:00 GMT

Overview

A per-sample data transformation pattern where Mapper operators apply deterministic, rule-based modifications to individual data samples, covering text cleaning, normalization, augmentation, and multimodal processing across text, image, audio, and video modalities.

Pattern

Mapper operators extend the Mapper base class and implement one of two processing interfaces:

1. Single-Sample Processing (process_single) -- Receives one sample dictionary, applies transformations, and returns the modified dictionary. This is the default mode for most mappers.

2. Batched Processing (process_batched) -- Receives a batch of samples (dict of lists), applies transformations in bulk, and returns the modified batch. Used for efficiency when vectorized operations are beneficial.

The Mapper base class handles the dispatch between single and batched modes, dataset iteration, and integration with the operator pipeline. Concrete mappers focus solely on implementing the transformation logic.

Common Transformation Categories

  • Text Cleaning -- Remove unwanted content: HTML tags, bibliographies, comments, headers, copyright notices, emails, IPs, links, specific characters, table text, long words, incorrect substrings
  • Text Normalization -- Standardize text: Unicode fixing, whitespace normalization, punctuation normalization, Chinese character conversion
  • Text Restructuring -- Sentence splitting, text chunking, macro expansion, content replacement
  • Data Augmentation -- NLP augmentation (NlpaugEn, NlpcdaZh), sentence augmentation
  • Custom Processing -- User-defined Python functions via PythonFileMapper/PythonLambdaMapper
  • Multimodal Processing -- Image blur/segmentation/tagging, video frame extraction/splitting/captioning, audio noise addition, pose estimation, depth estimation, object detection, face blurring, watermark removal
  • File Operations -- Download/upload files, S3 integration

Key Characteristics

  • Per-sample operation: each sample processed independently (parallelizable)
  • Two processing modes: single-sample and batched for flexibility and performance
  • Deterministic, rule-based transformations (not statistical or LLM-dependent)
  • Modality-agnostic base pattern applied across text, image, audio, and video
  • Registered via @OPERATORS.register_module() with YAML configuration
  • Many mappers have optional external library dependencies managed via lazy loading
  • Supports operator fusion for consecutive mappers that share loaded resources

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment