Principle:Datajuicer Data juicer Data Mapping Transformation
| Domains | Data_Processing, Data_Cleaning |
|---|---|
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A per-sample data transformation pattern where Mapper operators apply deterministic, rule-based modifications to individual data samples, covering text cleaning, normalization, augmentation, and multimodal processing across text, image, audio, and video modalities.
Pattern
Mapper operators extend the Mapper base class and implement one of two processing interfaces:
1. Single-Sample Processing (process_single) -- Receives one sample dictionary, applies transformations, and returns the modified dictionary. This is the default mode for most mappers.
2. Batched Processing (process_batched) -- Receives a batch of samples (dict of lists), applies transformations in bulk, and returns the modified batch. Used for efficiency when vectorized operations are beneficial.
The Mapper base class handles the dispatch between single and batched modes, dataset iteration, and integration with the operator pipeline. Concrete mappers focus solely on implementing the transformation logic.
Common Transformation Categories
- Text Cleaning -- Remove unwanted content: HTML tags, bibliographies, comments, headers, copyright notices, emails, IPs, links, specific characters, table text, long words, incorrect substrings
- Text Normalization -- Standardize text: Unicode fixing, whitespace normalization, punctuation normalization, Chinese character conversion
- Text Restructuring -- Sentence splitting, text chunking, macro expansion, content replacement
- Data Augmentation -- NLP augmentation (NlpaugEn, NlpcdaZh), sentence augmentation
- Custom Processing -- User-defined Python functions via PythonFileMapper/PythonLambdaMapper
- Multimodal Processing -- Image blur/segmentation/tagging, video frame extraction/splitting/captioning, audio noise addition, pose estimation, depth estimation, object detection, face blurring, watermark removal
- File Operations -- Download/upload files, S3 integration
Key Characteristics
- Per-sample operation: each sample processed independently (parallelizable)
- Two processing modes: single-sample and batched for flexibility and performance
- Deterministic, rule-based transformations (not statistical or LLM-dependent)
- Modality-agnostic base pattern applied across text, image, audio, and video
- Registered via
@OPERATORS.register_module()with YAML configuration - Many mappers have optional external library dependencies managed via lazy loading
- Supports operator fusion for consecutive mappers that share loaded resources
Implementations
- Implementation:Datajuicer_Data_juicer_CleanHtmlMapper
- Implementation:Datajuicer_Data_juicer_CleanCopyrightMapper
- Implementation:Datajuicer_Data_juicer_CleanEmailMapper
- Implementation:Datajuicer_Data_juicer_CleanIpMapper
- Implementation:Datajuicer_Data_juicer_CleanLinksMapper
- Implementation:Datajuicer_Data_juicer_FixUnicodeMapper
- Implementation:Datajuicer_Data_juicer_WhitespaceNormalizationMapper
- Implementation:Datajuicer_Data_juicer_PunctuationNormalizationMapper
- Implementation:Datajuicer_Data_juicer_ChineseConvertMapper
- Implementation:Datajuicer_Data_juicer_RemoveBibliographyMapper
- Implementation:Datajuicer_Data_juicer_RemoveCommentsMapper
- Implementation:Datajuicer_Data_juicer_RemoveHeaderMapper
- Implementation:Datajuicer_Data_juicer_RemoveLongWordsMapper
- Implementation:Datajuicer_Data_juicer_RemoveNonChineseCharacterlMapper
- Implementation:Datajuicer_Data_juicer_RemoveRepeatSentencesMapper
- Implementation:Datajuicer_Data_juicer_RemoveSpecificCharsMapper
- Implementation:Datajuicer_Data_juicer_RemoveTableTextMapper
- Implementation:Datajuicer_Data_juicer_RemoveWordsWithIncorrectSubstringsMapper
- Implementation:Datajuicer_Data_juicer_ReplaceContentMapper
- Implementation:Datajuicer_Data_juicer_ExpandMacroMapper
- Implementation:Datajuicer_Data_juicer_SentenceSplitMapper
- Implementation:Datajuicer_Data_juicer_TextChunkMapper
- Implementation:Datajuicer_Data_juicer_ExtractTablesFromHtmlMapper
- Implementation:Datajuicer_Data_juicer_PythonFileMapper
- Implementation:Datajuicer_Data_juicer_PythonLambdaMapper
- Implementation:Datajuicer_Data_juicer_NlpaugEnMapper
- Implementation:Datajuicer_Data_juicer_NlpcdaZhMapper
- Implementation:Datajuicer_Data_juicer_SentenceAugmentationMapper
- Implementation:Datajuicer_Data_juicer_AudioAddGaussianNoiseMapper
- Implementation:Datajuicer_Data_juicer_AudioFFmpegWrappedMapper
- Implementation:Datajuicer_Data_juicer_ImageBlurMapper
- Implementation:Datajuicer_Data_juicer_ImageFaceBlurMapper
- Implementation:Datajuicer_Data_juicer_ImageSegmentMapper
- Implementation:Datajuicer_Data_juicer_ImageRemoveBackgroundMapper
- Implementation:Datajuicer_Data_juicer_ImageTaggingMapper
- Implementation:Datajuicer_Data_juicer_ImageDetectionYoloMapper
- Implementation:Datajuicer_Data_juicer_ImageDiffusionMapper
- Implementation:Datajuicer_Data_juicer_ImageMMPoseMapper
- Implementation:Datajuicer_Data_juicer_ImageSAM3DBodyMapper
- Implementation:Datajuicer_Data_juicer_VideoExtractFramesMapper
- Implementation:Datajuicer_Data_juicer_VideoFFmpegWrappedMapper
- Implementation:Datajuicer_Data_juicer_VideoFaceBlurMapper
- Implementation:Datajuicer_Data_juicer_VideoRemoveWatermarkMapper
- Implementation:Datajuicer_Data_juicer_VideoResizeAspectRatioMapper
- Implementation:Datajuicer_Data_juicer_VideoResizeResolutionMapper
- Implementation:Datajuicer_Data_juicer_VideoSplitByDurationMapper
- Implementation:Datajuicer_Data_juicer_VideoSplitByKeyFrameMapper
- Implementation:Datajuicer_Data_juicer_VideoSplitBySceneMapper
- Implementation:Datajuicer_Data_juicer_VideoDepthEstimationMapper
- Implementation:Datajuicer_Data_juicer_VideoObjectSegmentingMapper
- Implementation:Datajuicer_Data_juicer_VideoUndistortMapper
- Implementation:Datajuicer_Data_juicer_VideoWholeBodyPoseEstimationMapper
- Implementation:Datajuicer_Data_juicer_VideoTaggingFromFramesMapper
- Implementation:Datajuicer_Data_juicer_VideoTaggingFromAudioMapper
- Implementation:Datajuicer_Data_juicer_VideoCameraPoseMapper
- Implementation:Datajuicer_Data_juicer_VideoCameraCalibrationStaticDeepcalibMapper
- Implementation:Datajuicer_Data_juicer_VideoCameraCalibrationStaticMogeMapper
- Implementation:Datajuicer_Data_juicer_VideoHandReconstructionMapper
- Implementation:Datajuicer_Data_juicer_VideoHandReconstructionHaworMapper
- Implementation:Datajuicer_Data_juicer_DownloadFileMapper
- Implementation:Datajuicer_Data_juicer_S3DownloadFileMapper
- Implementation:Datajuicer_Data_juicer_S3UploadFileMapper
- Implementation:Datajuicer_Data_juicer_SDXLPrompt2PromptMapper
- Implementation:Datajuicer_Data_juicer_VggtMapper