Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Protectai Llm guard Prompt Injection Detection

From Leeroopedia
Knowledge Sources
Domains NLP, Security, Adversarial_ML
Last Updated 2026-02-14 12:00 GMT

Overview

A binary text classification technique that detects adversarial prompt injection attacks by classifying input text as either legitimate user input or an injection attempt using fine-tuned transformer models.

Description

Prompt injection is an adversarial attack where a user crafts input that causes an LLM to ignore its system instructions and follow attacker-controlled directives instead. Detection relies on fine-tuned classification models (typically DeBERTa-based) trained on datasets of known injection patterns.

The detection supports multiple input segmentation strategies to handle different attack vectors:

  • Full text: Classify the entire prompt as one unit.
  • Sentence-level: Split into sentences and classify each independently (catches injections embedded in longer text).
  • Truncated head-tail: Analyze beginning and end of long prompts (catches tail-end injections).
  • Chunked: Split into overlapping character windows for very long inputs.

The highest injection score across all segments is used for the final decision against a configurable threshold (default: 0.92).

Usage

Use this principle as a mandatory first-line defense in any LLM-facing application. It should be one of the first scanners in the input pipeline to reject injection attempts before other scanners process the text.

Theoretical Basis

The detection follows a classify-and-aggregate pattern:

# Pseudocode for prompt injection detection
segments = match_type.get_inputs(prompt)  # Split by strategy
results = classifier(segments)            # Batch classification

highest_score = 0.0
for result in results:
    injection_score = result["score"] if result["label"] == "INJECTION" else 1 - result["score"]
    highest_score = max(highest_score, injection_score)
    if injection_score > threshold:
        return INJECTION_DETECTED

return SAFE

The model outputs a binary classification (INJECTION vs SAFE) with a confidence score. The score is compared against a threshold to make the final determination.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment