Principle:Microsoft BIPIA Training Data Construction

Overview

A supervised training data construction methodology that pairs poisoned prompts with correct (attack-ignoring) responses to teach LLMs to resist indirect prompt injection attacks.

Description

Training data construction builds supervised fine-tuning examples where each sample contains: (1) a poisoned prompt (task context with injected attack), and (2) the correct response (what the model should output, ignoring the attack). Three response strategies exist:

"original" -- ground-truth ideal from the dataset
"self_clean" -- the model's own response to the clean prompt
"gpt4_clean" -- GPT-4's response to the clean prompt

The data module supports combining all 5 task types (qa, email, code, table, summarization) and both text and code attack sets.

Usage

Use when preparing supervised finetuning data for white-box defense training. Choose response strategy based on available resources and desired defense behavior.

Theoretical Basis

The training signal teaches:

given(poisoned_prompt) → correct_response

Three oracle strategies provide the correct response:

Original: Uses human-annotated ideal answers from the benchmark dataset. This is the most direct signal but may not match the model's natural output style.
Self-clean: Uses the model's own output on clean prompts (self-distillation). This preserves the model's natural response distribution while teaching it to ignore attacks.
GPT4-clean: Uses GPT-4's output on clean prompts (cross-model distillation). This provides high-quality reference responses that may exceed the target model's baseline capability.

The choice of strategy affects the defense-capability tradeoff: "original" anchors to ground-truth but may degrade fluency, "self_clean" preserves the model's voice, and "gpt4_clean" aims for the highest response quality at the cost of requiring an additional inference pass with GPT-4.

Related Pages

Implementation:Microsoft_BIPIA_Load_Bipia_Supervised_Data_Module

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment