Principle:Microsoft BIPIA Clean Response Collection
Overview
A baseline data collection methodology that generates LLM responses on clean (attack-free) prompts to establish ground-truth reference outputs for defense training and capability evaluation.
Description
Clean response collection runs the target LLM on prompts that contain the original task context without any injected attacks, using the no_insert() function to bypass attack insertion. The resulting clean responses serve two purposes: (1) providing alternative training targets for the "self_clean" response strategy in white-box defense finetuning (where the model learns to produce the same output it would produce without attacks), and (2) establishing capability baselines for ROUGE evaluation.
Usage
Use before white-box defense finetuning when using the "self_clean" or "gpt4_clean" response strategy, or when evaluating model capability on clean data.
Theoretical Basis
The clean response acts as a "what the model should say" oracle. By using no_insert() (which returns context unchanged), the dataset has the same structure as attacked datasets but without malicious content. This enables:
clean_response = model(clean_prompt)
Then training proceeds as:
model(attacked_prompt) → clean_response
Because no_insert() simply returns its input unmodified, the prompt seen by the model during clean collection is structurally identical to an attacked prompt minus the injected payload. This structural parity ensures that any difference between clean and attacked outputs is attributable solely to the attack content, making the clean response a reliable oracle for defense training.