Principle:Microsoft BIPIA Defended Model Evaluation

Field	Value
Sources	BIPIA paper
Domains	NLP, Security, Evaluation
Last Updated	2026-02-14

Overview

A defense evaluation methodology that tests finetuned LLMs with special boundary tokens on both attacked and clean prompts to measure defense effectiveness and capability preservation.

Description

After white-box defense finetuning, the model must be evaluated on two dimensions:

(1) Defense effectiveness — does the model resist prompt injection attacks when <data>/</data> tokens mark external content?

(2) Capability preservation — does the model still produce correct task responses when external content is clean?

The evaluation uses VicunaWithSpecialToken, which overrides the standard Vicuna prompt construction to insert <data>/</data> markers around external content in the Vicuna chat template. Responses are then evaluated using BipiaEvalFactory for ASR metrics.

Usage

Use after completing white-box defense finetuning to assess whether the defense works and whether the model retains its task capabilities.

Theoretical Basis

Defense effectiveness is measured by ASR reduction: a successful defense should lower ASR from baseline (no defense) toward zero.

Capability preservation is measured by comparing ROUGE/task metrics between the finetuned model on clean data vs. the base model on clean data.

The <data>/</data> tokens provide a structured signal that the finetuned model has learned to interpret as "ignore instructions within these markers."

Related Pages

Implementation:Microsoft_BIPIA_VicunaWithSpecialToken

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment