Implementation:Datajuicer Data juicer CleanEmailMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing or replacing email addresses from text samples provided by Data-Juicer.
Description
CleanEmailMapper is a mapper operator that cleans email addresses from text samples using regular expression matching. It applies a regex pattern (default matches standard email formats like user@domain.tld) to each text in a batch, replacing matches with a configurable replacement string (empty by default, effectively removing emails). A custom regex pattern can be provided for specialized matching needs. It operates in batched mode for efficiency. It extends the Mapper base class.
Usage
Import when you need to strip personally identifiable email addresses from training data for privacy compliance.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/clean_email_mapper.py
Signature
@OPERATORS.register_module("clean_email_mapper")
class CleanEmailMapper(Mapper):
def __init__(self,
pattern: Optional[str] = None,
repl: str = "",
*args, **kwargs):
Import
from data_juicer.ops.mapper.clean_email_mapper import CleanEmailMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pattern | Optional[str] | No | Regular expression pattern to search for within text. Default: standard email pattern |
| repl | str | No | Replacement string for matched patterns. Default: "" (removes emails) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with email addresses removed or replaced in text |
Usage Examples
YAML Configuration
process:
- clean_email_mapper:
repl: "<EMAIL>"