Heuristic:Liu00222 Open Prompt Injection Defense Strategy Selection

Knowledge Sources	Open-Prompt-Injection Formalizing and Benchmarking Prompt Injection
Domains	Security, NLP, Optimization
Last Updated	2026-02-14 15:30 GMT

Overview

Decision framework for selecting among 12+ prompt injection defense strategies based on resource availability, task type, and acceptable overhead.

Description

The Open-Prompt-Injection toolkit implements a wide range of defense strategies against prompt injection attacks. These defenses fall into several categories: prompt engineering defenses (sandwich, instructional, delimiters, XML, random_seq), pre-processing defenses (paraphrasing, retokenization), detection-based defenses (PPL, known-answer, LLM-based), response filtering (response-based), and specialized model-based defenses (DataSentinel, PromptLocate). Each defense has different resource requirements, compatibility constraints, and effectiveness trade-offs.

Usage

Use this heuristic when choosing which defense strategy to deploy in an LLM-integrated application. The choice depends on: whether you have GPU resources, whether the task is classification or generation, how much latency overhead is acceptable, and whether you need to detect or localize injections.

The Insight (Rule of Thumb)

Zero-cost defenses (no extra model calls): `sandwich`, `instructional`, `delimiters`, `xml`, `random_seq` — these modify the prompt template only. Use as a baseline.
One extra model call: `paraphrasing` (preprocesses user data), `llm-based` (pre-query detection), `known-answer` (echo test with secret token "DGDSGNH").
Requires surrogate GPU model: `ppl` (perplexity-based detection) loads Vicuna-7B-v1.3 as a surrogate, needing CUDA + ~9GiB VRAM.
Requires fine-tuned model: DataSentinel and PromptLocate require a fine-tuned QLoRA checkpoint on Mistral-7B.
Response-based filtering: Only works for classification tasks (sst2, mrpc, rte, sms_spam, hsol). Not applicable to generation tasks (gigaword, jfleg).
Retokenization: Requires the BPE merge table file at `./data/subword_nmt.voc`. Uses dropout rate 0.1 with up to 10 retries.
Trade-off: Prompt engineering defenses add zero latency but limited effectiveness. Model-based defenses (PPL, DataSentinel) are more robust but require GPU and add inference latency.

Reasoning

The defense strategy selection is encoded in the `Application.__defense_preparation()` method (`apps/Application.py:59-100`) which branches on the defense string. The response-based filter explicitly excludes generative tasks:

From `apps/Application.py:60`:

if self.defense == 'response-based' and self.task.dataset not in ['gigaword', 'jfleg']:
    self.response_based_filter = {
        'sst2':eval_sst2,
        'mrpc':eval_mrpc,
        'rte':eval_rte,
        'sms_spam':eval_spam,
        'hsol':eval_hsol
    }[self.task.dataset]

The PPL defense requires loading a full 7B model as surrogate:

From `apps/Application.py:84-93`:

self.surrogate_backbone, self.surrogate_tokenizer = load_model(
    'lmsys/vicuna-7b-v1.3',
    "cuda",
    8,
    "9GiB",
    ...
)

The known-answer defense uses a hardcoded secret token with a TODO noting it should be dynamically generated:

From `apps/Application.py:148-149`:

# TODO: replace hard-coded secret data with one generated on-the-fly
secret_data = "DGDSGNH"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment