Principle:Microsoft BIPIA Defended Model Evaluation
| Field | Value |
|---|---|
| Sources | BIPIA paper |
| Domains | NLP, Security, Evaluation |
| Last Updated | 2026-02-14 |
Overview
A defense evaluation methodology that tests finetuned LLMs with special boundary tokens on both attacked and clean prompts to measure defense effectiveness and capability preservation.
Description
After white-box defense finetuning, the model must be evaluated on two dimensions:
(1) Defense effectiveness — does the model resist prompt injection attacks when <data>/</data> tokens mark external content?
(2) Capability preservation — does the model still produce correct task responses when external content is clean?
The evaluation uses VicunaWithSpecialToken, which overrides the standard Vicuna prompt construction to insert <data>/</data> markers around external content in the Vicuna chat template. Responses are then evaluated using BipiaEvalFactory for ASR metrics.
Usage
Use after completing white-box defense finetuning to assess whether the defense works and whether the model retains its task capabilities.
Theoretical Basis
Defense effectiveness is measured by ASR reduction: a successful defense should lower ASR from baseline (no defense) toward zero.
Capability preservation is measured by comparing ROUGE/task metrics between the finetuned model on clean data vs. the base model on clean data.
The <data>/</data> tokens provide a structured signal that the finetuned model has learned to interpret as "ignore instructions within these markers."