Principle:Protectai Llm guard PII Anonymization

Knowledge Sources	Presidio: Context aware, pluggable and customizable data protection and anonymization SDK DeBERTa: Decoding-enhanced BERT with Disentangled Attention LLM Guard
Domains	NLP, Data_Privacy, Named_Entity_Recognition
Last Updated	2026-02-14 12:00 GMT

Overview

A multi-layered entity recognition and replacement technique that detects personally identifiable information in text using NER models, regex patterns, and rule-based recognizers, then substitutes detected entities with reversible placeholders.

Description

PII anonymization combines three complementary detection methods to maximize recall:

NER-based detection: Transformer models (e.g., DeBERTa fine-tuned on Ai4Privacy dataset) identify named entities like persons, organizations, and locations.
Regex-based detection: Predefined patterns match structured PII such as credit card numbers, SSNs, email addresses, phone numbers, and UUIDs.
Rule-based recognizers: Custom recognizers handle locale-specific patterns (e.g., Chinese phone numbers, cryptocurrency addresses).

Detected entities are replaced with indexed placeholders (e.g., [REDACTED_PERSON_1]) and the original values are stored in a Vault for later deanonymization. Optionally, fake data can be generated instead of placeholders using the Faker library.

The system resolves conflicts between overlapping entity detections by merging adjacent entities of the same type and selecting the highest-confidence detection when multiple recognizers identify overlapping spans.

Usage

Use this principle when user prompts may contain PII that should not be sent to external LLM APIs. It is the first step in a reversible anonymization pipeline, paired with deanonymization on the output side. Essential for GDPR, HIPAA, and other data privacy compliance requirements.

Theoretical Basis

The anonymization pipeline follows a three-stage process:

# Pseudocode for PII anonymization
# Stage 1: Entity detection (multi-method)
entities = []
entities += ner_model.detect(text)        # Transformer NER
entities += regex_patterns.match(text)     # Regex patterns
entities += rule_recognizers.detect(text)  # Custom rules

# Stage 2: Conflict resolution
entities = resolve_overlaps(entities)      # Merge same-type, pick highest score
entities = merge_adjacent(entities)        # Merge whitespace-separated same-type

# Stage 3: Replacement with vault storage
for entity in sorted(entities, reverse=True):  # Replace from end to preserve indices
    placeholder = f"[REDACTED_{entity.type}_{index}]"
    vault.append((placeholder, entity.value))
    text = text.replace(entity.span, placeholder)

Entity detection confidence scores are used to calculate risk scores, normalized against a configurable threshold.

Related Pages

Implemented By

Implementation:Protectai_Llm_guard_Anonymize

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment