Principle:Protectai Llm guard Regex Pattern Matching
| Knowledge Sources | |
|---|---|
| Domains | Pattern_Matching, Content_Filtering |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
Detecting or validating text content using user-defined regular expression patterns with configurable matching strategies.
Description
Regex Pattern Matching is a content filtering principle that provides user-defined, pattern-based text scanning using regular expressions. Unlike fixed-purpose scanners, this principle offers a general-purpose mechanism for enforcing arbitrary text policies expressed as regex patterns. It serves as a flexible building block for custom validation rules that do not fit neatly into other specialized scanners.
The principle supports three matching modes that control how patterns are applied to text. Search mode finds the first occurrence of the pattern anywhere in the text, useful for detecting forbidden content. Fullmatch mode requires the entire text to match the pattern, useful for validating that input conforms to an expected format. All mode finds every occurrence of the pattern throughout the text, useful for comprehensive scanning and counting.
Two policy modes determine how matches are interpreted. In blocklist mode, a pattern match indicates that the text contains forbidden content and should be flagged. In allowlist mode, a pattern match indicates that the text contains required content, and the absence of a match triggers flagging. This duality allows regex patterns to serve both restrictive and permissive policy functions.
When violations are detected, the system supports optional redaction of matched content, replacing matched substrings with configurable placeholder text rather than rejecting the entire input.
Usage
Use this principle when you need to enforce custom text policies that can be expressed as regular expressions but are not covered by other specialized scanners. Common applications include detecting and redacting phone numbers, email addresses, or custom identifiers; validating that input follows a required format; blocking specific URL patterns or domains; enforcing naming conventions; and catching domain-specific patterns unique to your organization. This principle is ideal for rapid policy deployment since adding a new rule requires only defining a regex pattern, with no model training or data collection.
Theoretical Basis
The pattern matching algorithm operates as follows:
Pattern Compilation:
- Compile each user-defined regex pattern into an optimized automaton
- Patterns are compiled once at initialization and reused across scans
Matching Modes:
- Search: Apply
re.search(pattern, text)to find the first match anywhere in the text - Fullmatch: Apply
re.fullmatch(pattern, text)to check if the entire text matches the pattern - All: Apply
re.findall(pattern, text)orre.finditer(pattern, text)to locate every occurrence
Policy Evaluation:
- In blocklist mode: if any pattern produces a match, the text is flagged as containing forbidden content
- In allowlist mode: if no pattern produces a match, the text is flagged as missing required content
- Multiple patterns can be specified and are evaluated independently
Redaction (optional):
- For each match, replace the matched substring with a configurable placeholder
- Use
re.sub(pattern, replacement, text)to produce the redacted output - Return the redacted text alongside the validation result