Implementation:Datajuicer Data juicer ReplaceContentMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for regex-based content replacement in text samples provided by Data-Juicer.
Description
ReplaceContentMapper is a mapper operator that performs regex-based find-and-replace operations on text samples. It supports single or multiple pattern-replacement pairs for flexible content transformation. Patterns are pre-compiled with the re.DOTALL flag for multiline matching, and raw string notation is automatically stripped. Each pattern-replacement pair is applied sequentially to each text sample. A ValueError is raised if the lengths of patterns and replacements do not match. Operates in batched mode.
Usage
Use when you need pattern-based content cleaning, redaction, or reformatting of text within the data processing pipeline.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/replace_content_mapper.py
Signature
@OPERATORS.register_module("replace_content_mapper")
class ReplaceContentMapper(Mapper):
def __init__(self, pattern: Union[str, List[str], None] = None, repl: Union[str, List[str]] = "", *args, **kwargs):
Import
from data_juicer.ops.mapper.replace_content_mapper import ReplaceContentMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pattern | Union[str, List[str], None] | No | Regular expression pattern(s) to search for within text (default: None) |
| repl | Union[str, List[str]] | No | Replacement string(s) (default: empty string) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with content replaced |
Usage Examples
process:
- replace_content_mapper:
pattern: '\s+'
repl: ' '