Principle:Liu00222 Open Prompt Injection Causal Influence Analysis

Knowledge Sources	Open-Prompt-Injection
Domains	NLP, Causal_Inference, Language_Modeling
Last Updated	2026-02-14 15:00 GMT

Overview

A technique that measures whether a text segment disrupts the natural continuation of surrounding text by comparing conditional probabilities from a language model with and without the suspected segment.

Description

Causal Influence Analysis determines whether a middle segment of text is a natural continuation of its context or an injection by measuring its disruption effect. A helper language model (GPT-2) computes the average log-probability of a suffix segment conditioned on just the prefix versus conditioned on the prefix plus the suspected injected segment. If including the suspected segment significantly reduces the probability of the suffix (positive influence score), it is likely injected content because it disrupts the natural language flow.

Usage

Use this principle within the binary search localization pipeline to determine the end boundary of an injection region. After binary search finds the injection start, causal influence analysis scans subsequent segments to find where injected content ends and natural data resumes.

Theoretical Basis

The causal influence score is defined as:

$C I (i n j e c t e d) = \frac{1}{| s u f f i x |} \sum_{t} \log P (w_{t} | p r e f i x) - \frac{1}{| s u f f i x |} \sum_{t} \log P (w_{t} | p r e f i x + i n j e c t e d)$

Where:

$P (w_{t} | p r e f i x)$ is the probability of suffix tokens given only clean prefix
$P (w_{t} | p r e f i x + i n j e c t e d)$ is the probability given prefix plus suspected injection

A positive CI score indicates the suspected segment disrupts natural continuation, suggesting it is injected content.

Related Pages

Implemented By

Implementation:Liu00222_Open_Prompt_Injection_causal_influence

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment