Implementation:Huggingface Datatrove PIIFormatter
| Sources | Domains | Last Updated |
|---|---|---|
| Huggingface Datatrove | Privacy, Data_Cleaning | 2026-02-14 |
Overview
Formatter pipeline step that detects and replaces email addresses and IP addresses in document text with safe placeholder values.
Description
PIIFormatter extends BaseFormatter and uses two PIIReplacer instances -- one for emails and one for IP addresses. Each PIIReplacer wraps a compiled regex pattern and a tuple of replacement strings that are cycled through in round-robin fashion via a _replace_i counter. The format method applies email replacement first (if enabled), then IP replacement (if enabled), and returns the modified text.
The email regex is comprehensive and handles complex local parts with special characters (!#$%&'*+/=?^_`{|}~-), multiple dots, and bracketed IP-literal domains. The IP regex matches standard IPv4 dotted-decimal notation. An optional validator function on PIIReplacer (used for IPs) allows matches to be selectively skipped -- the public_ip_validator uses ipaddress.ip_address().is_global to distinguish public IPs from private/reserved ones.
The class inherits from BaseFormatter, which provides the pipeline integration: iterating over documents, applying format to each document's text, and yielding updated documents.
Usage
Add to a pipeline as a formatting step. Both email and IP removal are enabled by default. Disable either by passing remove_emails=False or remove_ips=False.
Code Reference
Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/formatters/pii.py (L42-94)
Signature:
class PIIFormatter(BaseFormatter):
def __init__(
self,
remove_emails: bool = True,
remove_ips: bool = True,
only_remove_public_ips: bool = True,
email_replacement: tuple[str, ...] | str = (
"email@example.com",
"firstname.lastname@example.org",
),
ip_replacement: tuple[str, ...] | str = (
"22.214.171.124",
"126.96.36.199",
"188.8.131.52",
"184.108.40.206",
"220.127.116.11",
"18.104.22.168",
),
):
Import:
from datatrove.pipeline.formatters import PIIFormatter
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| remove_emails | bool | No | Replace email addresses in text (default True) |
| remove_ips | bool | No | Replace IP addresses in text (default True) |
| only_remove_public_ips | bool | No | Only replace public (globally routable) IPs; skip private/reserved ranges (default True) |
| email_replacement | tuple[str, ...] or str | No | Replacement strings cycled for emails (default uses example.com and example.org addresses) |
| ip_replacement | tuple[str, ...] or str | No | Replacement strings cycled for IPs (default uses 6 non-responsive addresses) |
Pipeline I/O:
- Input: Document objects with text that may contain email addresses and IP addresses
- Output: Document objects with detected PII replaced by placeholder values
Usage Examples
Example 1 -- Default PII removal:
from datatrove.pipeline.formatters import PIIFormatter
formatter = PIIFormatter()
# Removes both emails and public IPs with default replacements
Example 2 -- Only remove emails with custom replacements:
from datatrove.pipeline.formatters import PIIFormatter
formatter = PIIFormatter(
remove_emails=True,
remove_ips=False,
email_replacement=("redacted@example.com",),
)
Example 3 -- Remove all IPs including private ranges:
from datatrove.pipeline.formatters import PIIFormatter
formatter = PIIFormatter(
remove_emails=False,
remove_ips=True,
only_remove_public_ips=False,
)