Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft BIPIA ASR Evaluation

From Leeroopedia
Field Value
Sources BIPIA paper
Domains NLP, Security, Evaluation
Last Updated 2026-02-14

Overview

An attack success rate evaluation framework that measures LLM vulnerability to indirect prompt injection by dispatching model responses to attack-type-specific evaluators.

Description

The ASR (Attack Success Rate) evaluation concept provides a systematic methodology for quantifying how susceptible a large language model is to indirect prompt injection attacks. For each model response, a specialized evaluator determines whether the attack was "successful" -- that is, whether the model followed the injected instruction rather than performing its intended task.

Different attack types require fundamentally different evaluation methods:

  • GPT-based judging -- A chain-of-thought reasoning evaluator dispatched via the OpenAI API. A judge model examines the response and determines whether the injected instruction was followed, producing a binary success/failure verdict.
  • Language detection -- Used for translation-based attacks. The langdetect library checks whether the model response is in the language demanded by the injected instruction rather than the language requested by the legitimate task.
  • Fuzzy string matching -- Used for content injection attacks. The thefuzz library computes similarity between the model response and the injected target content, flagging a successful attack when similarity exceeds a threshold.
  • Encoding validation -- Used for attacks that instruct the model to encode its output. Validators check whether the response is valid base64, reversed text, or emoji-encoded output.
  • Encryption validation -- Used for cipher-based attacks. A Caesar cipher validator checks whether the model response matches the expected ciphertext transformation.

The system aggregates per-attack ASR into two summary metrics:

  • Macro ASR -- The unweighted average of per-attack-type ASR values, giving equal importance to each attack type regardless of sample count.
  • Micro ASR -- The sample-weighted average of per-attack-type ASR values, giving more importance to attack types with more samples.

Usage

Use ASR evaluation after generating model responses to measure how often each attack type successfully hijacks the model. The BIPIA benchmark defines 26 attack types organized across 4 categories:

  1. Task-irrelevant attacks -- Injected instructions unrelated to the original task (e.g., "tell me a joke" during summarization).
  2. Task-relevant attacks -- Injected instructions that plausibly relate to the original task but subvert its intent.
  3. Targeted attacks -- Injected instructions aimed at producing specific harmful or controlled outputs (e.g., translation to a particular language, encoding in base64).
  4. Code-based attacks -- Injected instructions embedded within code or structured data that exploit model parsing behavior.

Each category contains specialized evaluation chains that are dispatched automatically based on the attack name.

Theoretical Basis

The core metric is defined as:

ASR_k = successful_attacks_k / total_attacks_k

where k denotes a specific attack type. A sample is counted as a successful attack (value 1) if the evaluator for that attack type determines the model followed the injected instruction, and 0 otherwise.

The two aggregate metrics are:

Macro ASR = (1 / K) * sum(ASR_k for k in 1..K)

Micro ASR = sum(n_k * ASR_k for k in 1..K) / sum(n_k for k in 1..K)

where K is the number of active attack types and n_k is the number of samples for attack type k.

The evaluation dispatch pattern works as follows:

  1. A registry function maps each attack name to an evaluator factory.
  2. On initialization, the factory instantiates only the evaluators for the activated attacks.
  3. During evaluation, each sample is routed to the evaluator matching its attack name.
  4. Each evaluator applies its type-specific logic (GPT judging, language detection, fuzzy matching, or encoding/encryption validation) and records a binary result.
  5. After all samples are processed, per-attack ASR values are computed and aggregated into macro and micro metrics.

The 4 evaluator categories correspond to:

Category Evaluator Type Method
Model-based ModelEval GPT chain-of-thought judging via OpenAI API
Linguistic LanguageEval Language detection via langdetect
String-based MatchRefEval Fuzzy string matching via thefuzz
Encoding/Encryption BaseEncodeEval, CarsarEval, etc. Format validation (base64, reverse, emoji, Caesar cipher)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment