Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft BIPIA Training Data Construction

From Leeroopedia

Template:Metadata

Overview

A supervised training data construction methodology that pairs poisoned prompts with correct (attack-ignoring) responses to teach LLMs to resist indirect prompt injection attacks.

Description

Training data construction builds supervised fine-tuning examples where each sample contains: (1) a poisoned prompt (task context with injected attack), and (2) the correct response (what the model should output, ignoring the attack). Three response strategies exist:

  • "original" -- ground-truth ideal from the dataset
  • "self_clean" -- the model's own response to the clean prompt
  • "gpt4_clean" -- GPT-4's response to the clean prompt

The data module supports combining all 5 task types (qa, email, code, table, summarization) and both text and code attack sets.

Usage

Use when preparing supervised finetuning data for white-box defense training. Choose response strategy based on available resources and desired defense behavior.

Theoretical Basis

The training signal teaches:

given(poisoned_prompt) → correct_response

Three oracle strategies provide the correct response:

  1. Original: Uses human-annotated ideal answers from the benchmark dataset. This is the most direct signal but may not match the model's natural output style.
  2. Self-clean: Uses the model's own output on clean prompts (self-distillation). This preserves the model's natural response distribution while teaching it to ignore attacks.
  3. GPT4-clean: Uses GPT-4's output on clean prompts (cross-model distillation). This provides high-quality reference responses that may exceed the target model's baseline capability.

The choice of strategy affects the defense-capability tradeoff: "original" anchors to ground-truth but may degrade fluency, "self_clean" preserves the model's voice, and "gpt4_clean" aims for the highest response quality at the cost of requiring an additional inference pass with GPT-4.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment