Workflow:Microsoft BIPIA White Box Defense Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLM_Security, Prompt_Injection, Fine_Tuning, Defense |
| Last Updated | 2026-02-14 15:00 GMT |
Overview
End-to-end process for finetuning an LLM with special data boundary tokens to teach it to ignore indirect prompt injection attacks embedded in external content.
Description
This workflow implements a white-box defense against indirect prompt injection by adding special boundary tokens ( and ) to the model's vocabulary and finetuning it to treat content within those markers as untrusted data. The approach uses supervised finetuning on attack-injected prompts paired with clean (non-attacked) responses, teaching the model to produce correct outputs even when malicious instructions are present in the external content. Three response construction strategies are supported: using BIPIA ground-truth labels, using the original model's clean responses, or using GPT-4-generated clean responses. Training uses DeepSpeed ZeRO Stage 3 for distributed multi-GPU finetuning.
Usage
Execute this workflow when you have access to the model weights and want to permanently improve an LLM's robustness against indirect prompt injection attacks through finetuning. You need sufficient GPU resources (8 V100 GPUs recommended for the training phase), the BIPIA training data across all five task types, and optionally clean response files generated by the target model or GPT-4.
Execution Steps
Step 1: Clean Response Collection
Generate clean (non-attacked) responses from the target LLM or GPT-4 on prompts without injected attacks. This step is needed for the "self_clean" and "gpt4_clean" response strategies. The clean response collector loads the BIPIA dataset without attack injection (using the no_insert function), runs inference through the selected model, and saves responses as a JSONL file. Each response is keyed by its prompt content for later lookup during training data construction.
Key considerations:
- Required only for "self_clean" and "gpt4_clean" strategies; "original" strategy uses BIPIA labels directly
- Must be run separately for each of the five task types (email, qa, abstract, table, code)
- The target model's own responses ("self_clean") may be less accurate but more stylistically consistent
- GPT-4 responses ("gpt4_clean") are typically higher quality but require API access
- Clean responses are matched to attacked prompts by stripping the attack from the context
Step 2: Training Data Construction
Build the supervised finetuning dataset by combining attack-injected prompts with clean target responses. For each sample, the AutoPIABuilder constructs the attack-injected context, then the response function selects the appropriate clean response based on the chosen strategy. The training data spans all five task types when dataset_name is set to "all", creating a comprehensive defense training set. Each sample is structured as a (user_prompt, clean_response) conversation pair.
Key considerations:
- All five task types can be combined into a single training dataset
- Attack insertion is performed normally (attacks are present in the training prompts)
- The target response is always the clean answer (ignoring the attack)
- This teaches the model to produce correct outputs despite attack presence
- Ignore guidance instructions can optionally be included in prompts
Step 3: Tokenizer and Model Preparation
Load the base model and tokenizer from the YAML configuration, then add the special and tokens to the vocabulary. The tokenizer is extended with these new tokens, and the model's embedding layers are resized accordingly. New token embeddings are initialized to the mean of existing embeddings for stable training. The model is configured with gradient checkpointing enabled to reduce memory usage.
Key considerations:
- Special tokens mark the boundaries of external (untrusted) content
- Embedding resize initializes new tokens to the mean of existing embeddings
- Padding token is set to the unknown token with right-side padding
- Model cache is disabled for training compatibility
- The Vicuna-style chat template (USER/ASSISTANT format) is used for prompt structure
Step 4: Training Data Tokenization
Tokenize each training sample by splitting the user prompt around the external context, inserting the special boundary tokens, and constructing the full input sequence. The sequence follows the format: BOS + system + USER: + pre_context + + context + + post_context + ASSISTANT: + response + EOS. Labels are masked (set to IGNORE_TOKEN_ID) for all tokens up to and including the ASSISTANT prefix, so the model is only trained to predict the clean response portion.
Key considerations:
- The and tokens explicitly mark external content boundaries
- Label masking ensures the model only learns to generate the response, not the prompt
- Sequences exceeding the maximum length (default 2048 tokens) are filtered out
- Tokenization is done at the sub-word level using the model's tokenizer
Step 5: Distributed Finetuning
Train the model using the HuggingFace Trainer with DeepSpeed ZeRO Stage 3 for distributed training across multiple GPUs. Training uses AdamW optimizer with cosine learning rate scheduling, warmup, and mixed-precision (FP16) training. The training loop supports checkpoint resumption and periodic saving. Training metrics can be tracked via Weights & Biases integration.
Key considerations:
- DeepSpeed ZeRO Stage 3 shards optimizer states, gradients, and parameters across GPUs
- Recommended: 8 V100 GPUs with FP16 training and gradient accumulation (steps=4)
- Default training: 1000 steps with learning rate 2e-5, batch size 4 per device
- Gradient checkpointing reduces memory at the cost of recomputation
- Model is saved via state dict collection to handle DeepSpeed sharding
Step 6: Defended Model Evaluation
Run inference with the finetuned model on the BIPIA test set to generate defended responses. The evaluation script loads the finetuned checkpoint with the extended vocabulary (including / tokens), constructs test prompts with boundary markers, and generates responses. The VicunaWithSpecialToken wrapper automatically inserts the special tokens around external content in the prompt.
Key considerations:
- The finetuned model path replaces the standard model configuration
- Special context tokens are automatically inserted during prompt construction
- The same inference pipeline (batched DataLoader, resume support) is used
- Output format is compatible with the standard ASR evaluation pipeline
Step 7: ASR Evaluation
Compute the Attack Success Rate on the finetuned model's responses using the standard BipiaEvalFactory evaluation pipeline. Compare results against the undefended baseline to measure the effectiveness of the white-box defense. The evaluation uses the same run.py evaluate mode, enabling direct comparison.
Key considerations:
- Uses the standard evaluation pipeline from the main benchmark workflow
- Results are directly comparable with both undefended and black-box defended baselines
- Per-attack-type analysis reveals which attacks the finetuning mitigates
- ROUGE evaluation on clean responses can verify the model retains task performance