Principle:Microsoft BIPIA Results Analysis
| Field | Value |
|---|---|
| Sources | BIPIA: Benchmarking Indirect Prompt Injection Attacks, ROUGE: A Package for Automatic Evaluation of Summaries |
| Domains | NLP, Evaluation, Metrics |
| Last Updated | 2026-02-14 |
Overview
A capability evaluation methodology that measures LLM response quality on clean (non-attacked) prompts using ROUGE recall metrics to establish baseline performance.
Description
Results Analysis uses ROUGE recall (not F1) to measure how well model responses capture the content of a reference or ideal response. This capability evaluation is separate from ASR (Attack Success Rate) evaluation -- it measures whether the model can still perform its primary task correctly even in the absence of attacks.
Four ROUGE recall variants are computed:
- ROUGE-1 -- unigram overlap recall
- ROUGE-2 -- bigram overlap recall
- ROUGE-L -- longest common subsequence-based recall
- ROUGE-Lsum -- summary-level longest common subsequence recall
The use of recall (rather than precision or F1) is intentional because it measures information coverage: the proportion of the reference content that is successfully reproduced in the model output. A high recall score indicates that the model response captures most of the important information from the reference, regardless of any additional content the model may have generated.
Usage
Use after collecting model responses on clean (non-attacked) data to assess baseline model capability for each task type. This establishes a performance reference point: if a model scores highly on ROUGE recall for clean prompts, any degradation observed under attack conditions can be attributed to the attack rather than to an inherent inability of the model.
Theoretical Basis
ROUGE-N recall is defined as:
ROUGE-N_recall = |overlapping_ngrams| / |reference_ngrams|
Where overlapping_ngrams is the set of n-grams shared between the candidate and the reference, and reference_ngrams is the total set of n-grams in the reference. This yields a value in the range [0, 1].
ROUGE-L uses the longest common subsequence (LCS) between the candidate and reference:
ROUGE-L_recall = LCS(candidate, reference) / length(reference)
The aggregation across multiple samples uses bootstrap sampling for confidence intervals, provided by the rouge_scorer.scoring.BootstrapAggregator class from the rouge_score library. This yields a mid-point estimate along with low and high confidence bounds for each metric.