Heuristic:Liu00222 Open Prompt Injection PPL Threshold Tuning

Knowledge Sources	Open-Prompt-Injection Baseline Defenses
Domains	Security, NLP, Optimization
Last Updated	2026-02-14 15:30 GMT

Overview

Perplexity-based defense threshold and window size configuration for detecting prompt injections via anomalous language patterns.

Description

The PPL (perplexity) defense detects prompt injections by measuring the perplexity of user input using a surrogate language model (Vicuna-7B-v1.3). Injected text typically has higher perplexity than natural text because it contains instruction-like language mixed with data. The defense supports two modes: whole-sequence perplexity (`window_size=all`) and windowed perplexity (`window_size=N`) which slides a window of N tokens across the input. The threshold determines the maximum acceptable mean negative log-likelihood before flagging input as injected.

Usage

Use this heuristic when configuring the PPL defense for a prompt injection experiment. The defense string format is `ppl-{window_size}-{threshold}` (e.g., `ppl-all-3.0` or `ppl-50-3.0`). Tuning the threshold trades off between false positives (blocking legitimate input) and false negatives (allowing injections through).

The Insight (Rule of Thumb)

Action: Set the defense parameter string as `ppl-{window_size}-{threshold}`.
Value: `window_size` can be `all` (whole sequence) or a positive integer (e.g., 50, 100). Threshold is a float representing the maximum acceptable mean NLL.
Trade-off: Lower threshold = more aggressive detection (higher false positive rate). Higher threshold = more permissive (higher false negative rate).
Windowed mode: More effective at detecting injections embedded in longer text, since the injection window will have anomalously high perplexity even if the overall sequence perplexity is normal.
Resource cost: Requires loading Vicuna-7B-v1.3 via fastchat with 8 GPUs and 9GiB memory allocation. This is a significant overhead.

Reasoning

The PPL defense is configured by parsing the defense string in `apps/Application.py:69-81`:

# The expected format is "ppl-<window_size>-<threshold>"
# For window-based PPL detection, <window_size> is a positive integer
# For non-window-based PPL detection, <window_size> should always be "all"
assert (len(self.defense.split('-')) == 3 and self.defense.split('-')[0] == 'ppl')
self.ppl_window_size = self.defense.split('-')[1]
self.ppl_threshold = float(self.defense.split('-')[2])
if self.ppl_window_size == 'all':
    pass
else:
    self.ppl_window_size = int(self.ppl_window_size)
    assert (self.ppl_window_size > 0)

The PerplexityFilter class (adapted from the baseline-defenses repository) computes mean NLL per token:

From `apps/utils.py:34-47`:

def filter(self, sequences):
    filtered_log_ppl = []
    passed_filter = []
    for sequence in sequences:
        log_probs = self.get_log_prob(sequence)
        NLL_by_token = log_probs
        if NLL_by_token.mean() <= self.threshold:
            passed_filter.append(True)
        else:
            passed_filter.append(False)

The windowed mode iterates over the sequence in chunks (`apps/utils.py:49-80`), flagging the input if any window exceeds the threshold.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment