Principle:Liu00222 Open Prompt Injection Causal Influence Analysis
| Knowledge Sources | |
|---|---|
| Domains | NLP, Causal_Inference, Language_Modeling |
| Last Updated | 2026-02-14 15:00 GMT |
Overview
A technique that measures whether a text segment disrupts the natural continuation of surrounding text by comparing conditional probabilities from a language model with and without the suspected segment.
Description
Causal Influence Analysis determines whether a middle segment of text is a natural continuation of its context or an injection by measuring its disruption effect. A helper language model (GPT-2) computes the average log-probability of a suffix segment conditioned on just the prefix versus conditioned on the prefix plus the suspected injected segment. If including the suspected segment significantly reduces the probability of the suffix (positive influence score), it is likely injected content because it disrupts the natural language flow.
Usage
Use this principle within the binary search localization pipeline to determine the end boundary of an injection region. After binary search finds the injection start, causal influence analysis scans subsequent segments to find where injected content ends and natural data resumes.
Theoretical Basis
The causal influence score is defined as:
Where:
- is the probability of suffix tokens given only clean prefix
- is the probability given prefix plus suspected injection
A positive CI score indicates the suspected segment disrupts natural continuation, suggesting it is injected content.