Principle:Microsoft BIPIA Training Data Construction
Overview
A supervised training data construction methodology that pairs poisoned prompts with correct (attack-ignoring) responses to teach LLMs to resist indirect prompt injection attacks.
Description
Training data construction builds supervised fine-tuning examples where each sample contains: (1) a poisoned prompt (task context with injected attack), and (2) the correct response (what the model should output, ignoring the attack). Three response strategies exist:
- "original" -- ground-truth ideal from the dataset
- "self_clean" -- the model's own response to the clean prompt
- "gpt4_clean" -- GPT-4's response to the clean prompt
The data module supports combining all 5 task types (qa, email, code, table, summarization) and both text and code attack sets.
Usage
Use when preparing supervised finetuning data for white-box defense training. Choose response strategy based on available resources and desired defense behavior.
Theoretical Basis
The training signal teaches:
given(poisoned_prompt) → correct_response
Three oracle strategies provide the correct response:
- Original: Uses human-annotated ideal answers from the benchmark dataset. This is the most direct signal but may not match the model's natural output style.
- Self-clean: Uses the model's own output on clean prompts (self-distillation). This preserves the model's natural response distribution while teaching it to ignore attacks.
- GPT4-clean: Uses GPT-4's output on clean prompts (cross-model distillation). This provides high-quality reference responses that may exceed the target model's baseline capability.
The choice of strategy affects the defense-capability tradeoff: "original" anchors to ground-truth but may degrade fluency, "self_clean" preserves the model's voice, and "gpt4_clean" aims for the highest response quality at the cost of requiring an additional inference pass with GPT-4.