Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft BIPIA Clean Response Collection

From Leeroopedia

Template:Metadata

Overview

A baseline data collection methodology that generates LLM responses on clean (attack-free) prompts to establish ground-truth reference outputs for defense training and capability evaluation.

Description

Clean response collection runs the target LLM on prompts that contain the original task context without any injected attacks, using the no_insert() function to bypass attack insertion. The resulting clean responses serve two purposes: (1) providing alternative training targets for the "self_clean" response strategy in white-box defense finetuning (where the model learns to produce the same output it would produce without attacks), and (2) establishing capability baselines for ROUGE evaluation.

Usage

Use before white-box defense finetuning when using the "self_clean" or "gpt4_clean" response strategy, or when evaluating model capability on clean data.

Theoretical Basis

The clean response acts as a "what the model should say" oracle. By using no_insert() (which returns context unchanged), the dataset has the same structure as attacked datasets but without malicious content. This enables:

clean_response = model(clean_prompt)

Then training proceeds as:

model(attacked_prompt) → clean_response

Because no_insert() simply returns its input unmodified, the prompt seen by the model during clean collection is structurally identical to an attacked prompt minus the injected payload. This structural parity ensures that any difference between clean and attacked outputs is attributable solely to the attack content, making the clean response a reliable oracle for defense training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment