Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove PIIFormatter

From Leeroopedia
Sources Domains Last Updated
Huggingface Datatrove Privacy, Data_Cleaning 2026-02-14

Overview

Formatter pipeline step that detects and replaces email addresses and IP addresses in document text with safe placeholder values.

Description

PIIFormatter extends BaseFormatter and uses two PIIReplacer instances -- one for emails and one for IP addresses. Each PIIReplacer wraps a compiled regex pattern and a tuple of replacement strings that are cycled through in round-robin fashion via a _replace_i counter. The format method applies email replacement first (if enabled), then IP replacement (if enabled), and returns the modified text.

The email regex is comprehensive and handles complex local parts with special characters (!#$%&'*+/=?^_`{|}~-), multiple dots, and bracketed IP-literal domains. The IP regex matches standard IPv4 dotted-decimal notation. An optional validator function on PIIReplacer (used for IPs) allows matches to be selectively skipped -- the public_ip_validator uses ipaddress.ip_address().is_global to distinguish public IPs from private/reserved ones.

The class inherits from BaseFormatter, which provides the pipeline integration: iterating over documents, applying format to each document's text, and yielding updated documents.

Usage

Add to a pipeline as a formatting step. Both email and IP removal are enabled by default. Disable either by passing remove_emails=False or remove_ips=False.

Code Reference

Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/formatters/pii.py (L42-94)

Signature:

class PIIFormatter(BaseFormatter):
    def __init__(
        self,
        remove_emails: bool = True,
        remove_ips: bool = True,
        only_remove_public_ips: bool = True,
        email_replacement: tuple[str, ...] | str = (
            "email@example.com",
            "firstname.lastname@example.org",
        ),
        ip_replacement: tuple[str, ...] | str = (
            "22.214.171.124",
            "126.96.36.199",
            "188.8.131.52",
            "184.108.40.206",
            "220.127.116.11",
            "18.104.22.168",
        ),
    ):

Import:

from datatrove.pipeline.formatters import PIIFormatter

I/O Contract

Inputs:

Parameter Type Required Description
remove_emails bool No Replace email addresses in text (default True)
remove_ips bool No Replace IP addresses in text (default True)
only_remove_public_ips bool No Only replace public (globally routable) IPs; skip private/reserved ranges (default True)
email_replacement tuple[str, ...] or str No Replacement strings cycled for emails (default uses example.com and example.org addresses)
ip_replacement tuple[str, ...] or str No Replacement strings cycled for IPs (default uses 6 non-responsive addresses)

Pipeline I/O:

  • Input: Document objects with text that may contain email addresses and IP addresses
  • Output: Document objects with detected PII replaced by placeholder values

Usage Examples

Example 1 -- Default PII removal:

from datatrove.pipeline.formatters import PIIFormatter

formatter = PIIFormatter()
# Removes both emails and public IPs with default replacements

Example 2 -- Only remove emails with custom replacements:

from datatrove.pipeline.formatters import PIIFormatter

formatter = PIIFormatter(
    remove_emails=True,
    remove_ips=False,
    email_replacement=("redacted@example.com",),
)

Example 3 -- Remove all IPs including private ranges:

from datatrove.pipeline.formatters import PIIFormatter

formatter = PIIFormatter(
    remove_emails=False,
    remove_ips=True,
    only_remove_public_ips=False,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment