Principle:Microsoft BIPIA Results Analysis

Field	Value
Sources	BIPIA: Benchmarking Indirect Prompt Injection Attacks, ROUGE: A Package for Automatic Evaluation of Summaries
Domains	NLP, Evaluation, Metrics
Last Updated	2026-02-14

Overview

A capability evaluation methodology that measures LLM response quality on clean (non-attacked) prompts using ROUGE recall metrics to establish baseline performance.

Description

Results Analysis uses ROUGE recall (not F1) to measure how well model responses capture the content of a reference or ideal response. This capability evaluation is separate from ASR (Attack Success Rate) evaluation -- it measures whether the model can still perform its primary task correctly even in the absence of attacks.

Four ROUGE recall variants are computed:

ROUGE-1 -- unigram overlap recall
ROUGE-2 -- bigram overlap recall
ROUGE-L -- longest common subsequence-based recall
ROUGE-Lsum -- summary-level longest common subsequence recall

The use of recall (rather than precision or F1) is intentional because it measures information coverage: the proportion of the reference content that is successfully reproduced in the model output. A high recall score indicates that the model response captures most of the important information from the reference, regardless of any additional content the model may have generated.

Usage

Use after collecting model responses on clean (non-attacked) data to assess baseline model capability for each task type. This establishes a performance reference point: if a model scores highly on ROUGE recall for clean prompts, any degradation observed under attack conditions can be attributed to the attack rather than to an inherent inability of the model.

Theoretical Basis

ROUGE-N recall is defined as:

ROUGE-N_recall = |overlapping_ngrams| / |reference_ngrams|

Where overlapping_ngrams is the set of n-grams shared between the candidate and the reference, and reference_ngrams is the total set of n-grams in the reference. This yields a value in the range [0, 1].

ROUGE-L uses the longest common subsequence (LCS) between the candidate and reference:

ROUGE-L_recall = LCS(candidate, reference) / length(reference)

The aggregation across multiple samples uses bootstrap sampling for confidence intervals, provided by the rouge_scorer.scoring.BootstrapAggregator class from the rouge_score library. This yields a mid-point estimate along with low and high confidence bounds for each metric.

Related Pages

Implementation:Microsoft_BIPIA_RougeRecall

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment