Workflow:Protectai Llm guard PII Anonymization Deanonymization
| Knowledge Sources | |
|---|---|
| Domains | LLM_Security, PII_Protection, Data_Privacy |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
End-to-end process for protecting personally identifiable information (PII) in LLM interactions by anonymizing sensitive entities before the LLM call and restoring them in the output afterward.
Description
This workflow implements the Anonymize-Deanonymize pattern, a cross-cutting concern that spans the entire LLM call boundary. The Anonymize input scanner detects PII entities (names, emails, phone numbers, credit cards, IP addresses, and more) using NER models, regex patterns, and the Presidio framework. Detected entities are replaced with consistent placeholders or fake data. A shared Vault object stores the mapping between placeholders and original values. After the LLM produces a response using the anonymized prompt, the Deanonymize output scanner restores original values by reversing the placeholder mappings from the Vault.
Usage
Execute this workflow when your LLM application processes user data containing personally identifiable information that must not be exposed to the LLM provider. This is critical for compliance with privacy regulations (GDPR, HIPAA, CCPA) and for preventing PII leakage through model responses.
Execution Steps
Step 1: Initialize the Vault
Create a Vault instance that will store the bidirectional mapping between original PII entities and their placeholder replacements. This single Vault instance must be shared between the Anonymize input scanner and the Deanonymize output scanner.
Key considerations:
- The Vault is an in-memory store that persists for the lifetime of a single request
- One Vault instance must be shared across the Anonymize and Deanonymize scanner pair
- The Vault maps placeholder strings back to their original values for later restoration
Step 2: Configure the Anonymize scanner
Instantiate the Anonymize input scanner with the shared Vault and configure its detection capabilities. The scanner supports multiple detection methods: transformer-based NER models, regex pattern matching, and Presidio-based entity recognition. Choose between placeholder replacement (e.g., replacing "John Doe" with "[PERSON_1]") and faker-based replacement (e.g., replacing with a realistic fake name).
Key considerations:
- Set use_faker to True for realistic fake data replacement, False for bracket-style placeholders
- Configure the NER model: the default model handles common entities; specialized models like the AI4Privacy DeBERTa model provide broader coverage
- Set recognition thresholds to balance recall (catching all PII) against precision (avoiding false positives)
- Supported entity types include PERSON, EMAIL, PHONE_NUMBER, CREDIT_CARD, IP_ADDRESS, LOCATION, and many more
- Language-specific recognizers are available for Chinese text (phone, email, IP, crypto addresses)
Step 3: Configure the Deanonymize scanner
Instantiate the Deanonymize output scanner with the same Vault instance used by the Anonymize scanner. Configure the matching strategy that determines how placeholders in the LLM output are mapped back to original values.
Key considerations:
- The matching_strategy parameter controls how aggressively the scanner searches for placeholders
- exact matching looks for exact placeholder strings in the output
- case_insensitive matching handles cases where the LLM changed the case of placeholders
- fuzzy matching handles cases where the LLM slightly modified placeholder text
- combined matching tries all strategies in sequence
Step 4: Scan and anonymize the prompt
Run the user prompt through the Anonymize scanner. The scanner detects PII entities, stores the original-to-placeholder mappings in the Vault, and returns the anonymized prompt. The anonymized prompt is safe to send to the LLM because all sensitive data has been replaced.
What happens:
- NER models identify named entities (persons, organizations, locations)
- Regex patterns detect structured data (emails, phone numbers, credit cards, IP addresses)
- Presidio analyzers provide additional entity recognition with configurable confidence scores
- Each detected entity is mapped to a unique, consistent placeholder
- Identical entity values receive the same placeholder for consistency
Step 5: Send anonymized prompt to the LLM
Pass the anonymized prompt to the LLM API. The model processes the prompt without ever seeing the original PII, generating a response that may contain the placeholder tokens.
Key considerations:
- The LLM sees only placeholders or fake data, never the real PII
- The LLM's response quality may differ when working with placeholders versus real data
- Faker-based replacement generally produces more natural model responses than bracket-style placeholders
Step 6: Deanonymize the output
Run the LLM's response through the Deanonymize output scanner. The scanner locates placeholder tokens in the response text and replaces them with the original PII values from the Vault, producing a response that contains the correct original information.
Key considerations:
- The Deanonymize scanner requires the prompt as context to properly match placeholders
- If the LLM hallucinated new placeholder-like strings, they will not be deanonymized (no Vault entry)
- The restored output should be treated as sensitive and handled according to your data retention policies