Principle:Protectai Llm guard Prompt Injection Detection
| Knowledge Sources | |
|---|---|
| Domains | NLP, Security, Adversarial_ML |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
A binary text classification technique that detects adversarial prompt injection attacks by classifying input text as either legitimate user input or an injection attempt using fine-tuned transformer models.
Description
Prompt injection is an adversarial attack where a user crafts input that causes an LLM to ignore its system instructions and follow attacker-controlled directives instead. Detection relies on fine-tuned classification models (typically DeBERTa-based) trained on datasets of known injection patterns.
The detection supports multiple input segmentation strategies to handle different attack vectors:
- Full text: Classify the entire prompt as one unit.
- Sentence-level: Split into sentences and classify each independently (catches injections embedded in longer text).
- Truncated head-tail: Analyze beginning and end of long prompts (catches tail-end injections).
- Chunked: Split into overlapping character windows for very long inputs.
The highest injection score across all segments is used for the final decision against a configurable threshold (default: 0.92).
Usage
Use this principle as a mandatory first-line defense in any LLM-facing application. It should be one of the first scanners in the input pipeline to reject injection attempts before other scanners process the text.
Theoretical Basis
The detection follows a classify-and-aggregate pattern:
# Pseudocode for prompt injection detection
segments = match_type.get_inputs(prompt) # Split by strategy
results = classifier(segments) # Batch classification
highest_score = 0.0
for result in results:
injection_score = result["score"] if result["label"] == "INJECTION" else 1 - result["score"]
highest_score = max(highest_score, injection_score)
if injection_score > threshold:
return INJECTION_DETECTED
return SAFE
The model outputs a binary classification (INJECTION vs SAFE) with a confidence score. The score is compared against a threshold to make the final determination.