Principle:Protectai Llm guard Regex Pattern Matching

Knowledge Sources	Protectai_Llm_guard
Domains	Pattern_Matching, Content_Filtering
Last Updated	2026-02-14 12:00 GMT

Overview

Detecting or validating text content using user-defined regular expression patterns with configurable matching strategies.

Description

Regex Pattern Matching is a content filtering principle that provides user-defined, pattern-based text scanning using regular expressions. Unlike fixed-purpose scanners, this principle offers a general-purpose mechanism for enforcing arbitrary text policies expressed as regex patterns. It serves as a flexible building block for custom validation rules that do not fit neatly into other specialized scanners.

The principle supports three matching modes that control how patterns are applied to text. Search mode finds the first occurrence of the pattern anywhere in the text, useful for detecting forbidden content. Fullmatch mode requires the entire text to match the pattern, useful for validating that input conforms to an expected format. All mode finds every occurrence of the pattern throughout the text, useful for comprehensive scanning and counting.

Two policy modes determine how matches are interpreted. In blocklist mode, a pattern match indicates that the text contains forbidden content and should be flagged. In allowlist mode, a pattern match indicates that the text contains required content, and the absence of a match triggers flagging. This duality allows regex patterns to serve both restrictive and permissive policy functions.

When violations are detected, the system supports optional redaction of matched content, replacing matched substrings with configurable placeholder text rather than rejecting the entire input.

Usage

Use this principle when you need to enforce custom text policies that can be expressed as regular expressions but are not covered by other specialized scanners. Common applications include detecting and redacting phone numbers, email addresses, or custom identifiers; validating that input follows a required format; blocking specific URL patterns or domains; enforcing naming conventions; and catching domain-specific patterns unique to your organization. This principle is ideal for rapid policy deployment since adding a new rule requires only defining a regex pattern, with no model training or data collection.

Theoretical Basis

The pattern matching algorithm operates as follows:

Pattern Compilation:

Compile each user-defined regex pattern into an optimized automaton
Patterns are compiled once at initialization and reused across scans

Matching Modes:

Search: Apply re.search(pattern, text) to find the first match anywhere in the text
Fullmatch: Apply re.fullmatch(pattern, text) to check if the entire text matches the pattern
All: Apply re.findall(pattern, text) or re.finditer(pattern, text) to locate every occurrence

Policy Evaluation:

In blocklist mode: if any pattern produces a match, the text is flagged as containing forbidden content
In allowlist mode: if no pattern produces a match, the text is flagged as missing required content
Multiple patterns can be specified and are evaluated independently

Redaction (optional):

For each match, replace the matched substring with a configurable placeholder
Use re.sub(pattern, replacement, text) to produce the redacted output
Return the redacted text alongside the validation result

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment