Principle:Datajuicer Data juicer Data Mapping Transformation

Domains	Data_Processing, Data_Cleaning
Last Updated	2026-02-14 17:00 GMT

Overview

A per-sample data transformation pattern where Mapper operators apply deterministic, rule-based modifications to individual data samples, covering text cleaning, normalization, augmentation, and multimodal processing across text, image, audio, and video modalities.

Pattern

Mapper operators extend the Mapper base class and implement one of two processing interfaces:

1. Single-Sample Processing (process_single) -- Receives one sample dictionary, applies transformations, and returns the modified dictionary. This is the default mode for most mappers.

2. Batched Processing (process_batched) -- Receives a batch of samples (dict of lists), applies transformations in bulk, and returns the modified batch. Used for efficiency when vectorized operations are beneficial.

The Mapper base class handles the dispatch between single and batched modes, dataset iteration, and integration with the operator pipeline. Concrete mappers focus solely on implementing the transformation logic.

Common Transformation Categories

Text Cleaning -- Remove unwanted content: HTML tags, bibliographies, comments, headers, copyright notices, emails, IPs, links, specific characters, table text, long words, incorrect substrings
Text Normalization -- Standardize text: Unicode fixing, whitespace normalization, punctuation normalization, Chinese character conversion
Text Restructuring -- Sentence splitting, text chunking, macro expansion, content replacement
Data Augmentation -- NLP augmentation (NlpaugEn, NlpcdaZh), sentence augmentation
Custom Processing -- User-defined Python functions via PythonFileMapper/PythonLambdaMapper
Multimodal Processing -- Image blur/segmentation/tagging, video frame extraction/splitting/captioning, audio noise addition, pose estimation, depth estimation, object detection, face blurring, watermark removal
File Operations -- Download/upload files, S3 integration

Key Characteristics

Per-sample operation: each sample processed independently (parallelizable)
Two processing modes: single-sample and batched for flexibility and performance
Deterministic, rule-based transformations (not statistical or LLM-dependent)
Modality-agnostic base pattern applied across text, image, audio, and video
Registered via @OPERATORS.register_module() with YAML configuration
Many mappers have optional external library dependencies managed via lazy loading
Supports operator fusion for consecutive mappers that share loaded resources

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment