Implementation:Datajuicer Data juicer CleanIpMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing or replacing IPv4 and IPv6 addresses from text samples provided by Data-Juicer.
Description
CleanIpMapper is a mapper operator that cleans IP addresses from text samples using regular expression matching. It applies a regex pattern that matches both IPv4 addresses (e.g., 192.168.1.1) and IPv6 addresses (colon-separated hex groups) to each text in a batch, replacing matches with a configurable replacement string (empty by default, effectively removing IPs). A custom regex pattern can be provided for specialized needs. It operates in batched mode for efficiency. It extends the Mapper base class.
Usage
Import when you need to anonymize datasets by removing IP addresses that could identify users or systems.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/clean_ip_mapper.py
Signature
@OPERATORS.register_module("clean_ip_mapper")
class CleanIpMapper(Mapper):
def __init__(self,
pattern: Optional[str] = None,
repl: str = "",
*args, **kwargs):
Import
from data_juicer.ops.mapper.clean_ip_mapper import CleanIpMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pattern | Optional[str] | No | Regular expression pattern to search for within text. Default: pattern matching IPv4 and IPv6 addresses |
| repl | str | No | Replacement string for matched patterns. Default: "" (removes IP addresses) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with IP addresses removed or replaced in text |
Usage Examples
YAML Configuration
process:
- clean_ip_mapper:
repl: "<IP>"