Principle:Lakeraai Pint benchmark Prompt Injection Detection
| Knowledge Sources | |
|---|---|
| Domains | NLP, Security, Prompt_Injection |
| Last Updated | 2026-02-14 14:00 GMT |
Overview
A text classification technique that determines whether a given input prompt contains an injection attack attempting to override or manipulate an LLM's intended behavior.
Description
Prompt injection detection is a binary classification task: given an input string, determine whether it contains malicious instructions designed to hijack a language model's behavior. Detection methods range from rule-based heuristics to fine-tuned transformer classifiers.
In the PINT Benchmark context, detection is performed by passing individual text samples through a classifier and checking whether the output label matches the known injection label. For models with limited context windows, the input is chunked with overlapping strides, and an any-positive aggregation strategy is used: if any chunk is classified as injection, the entire input is flagged.
This approach addresses two challenges:
- Long input handling: Real-world prompts may exceed a model's token limit. Chunking with 25% overlap ensures injections near boundaries are captured.
- Architecture heterogeneity: Standard HuggingFace pipelines return label dictionaries, while SetFit models return integer predictions. The detection logic normalizes both output formats into a boolean result.
Usage
Use this technique when evaluating a prompt injection detection model's accuracy on individual text samples. It is the core inference step in the PINT Benchmark's Hugging Face evaluation workflow, invoked once per dataset row during benchmark execution.
Theoretical Basis
The detection follows a chunked binary classification with any-positive aggregation:
# Abstract algorithm (NOT real implementation)
chunks = chunk_with_overlap(prompt, max_length, stride=max_length//4)
predictions = [classify(chunk) for chunk in chunks]
is_injection = any(pred == INJECTION_LABEL for pred in predictions)
The any-positive aggregation is chosen because prompt injection payloads are typically localized within the input text, and a single positive detection in any chunk is sufficient evidence to flag the input.
For standard HuggingFace models:
- The pipeline returns
[{"label": "INJECTION", "score": 0.99}] - Detection checks:
label == injection_label
For SetFit models:
- The predictor returns an integer (0 or 1)
- Detection checks:
prediction == 1