Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Datatrove PII Removal

From Leeroopedia
Sources Domains Last Updated
Huggingface Datatrove Privacy, Data_Cleaning 2026-02-14

Overview

Detecting and replacing personally identifiable information in document text to protect privacy.

Description

PII removal identifies email addresses and IP addresses in document text using regex patterns and replaces them with safe placeholder values. The approach uses two distinct detection strategies:

  • Email detection -- a comprehensive regex pattern matching RFC-compliant email addresses, including local parts with special characters and domain names with subdomains
  • IP address detection -- a regex matching IPv4 addresses in dotted-decimal notation, with an optional validation step using Python's ipaddress module to distinguish public (PII) from private/reserved IPs

Replacements use circular replacement lists to provide variety in placeholder values. For example, the default email replacements cycle through email@example.com and firstname.lastname@example.org (using the reserved example.com and example.org domains). IP replacements cycle through a list of randomly generated addresses that were verified to not respond to ping requests at the time the list was created.

The separation of detection (regex + validation) from replacement (circular list) allows each PII type to be independently configured or disabled.

Usage

As a formatting step before publishing datasets, especially after deduplication and other text processing. Typically placed near the end of the pipeline, before the final writer stage.

Theoretical Basis

PII detection through pattern matching is a practical approach for structured identifiers (emails, IPs) that follow well-defined syntactic rules. The use of the ipaddress module for IP validation ensures that private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) and other reserved ranges are not unnecessarily replaced, as these do not constitute personally identifiable information.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment