Workflow:Microsoft BIPIA White Box Defense Finetuning

Knowledge Sources	Microsoft BIPIA Benchmarking and Defending Against Indirect Prompt Injection Attacks
Domains	LLM_Security, Prompt_Injection, Fine_Tuning, Defense
Last Updated	2026-02-14 15:00 GMT

Overview

End-to-end process for finetuning an LLM with special data boundary tokens to teach it to ignore indirect prompt injection attacks embedded in external content.

Description

This workflow implements a white-box defense against indirect prompt injection by adding special boundary tokens ( and ) to the model's vocabulary and finetuning it to treat content within those markers as untrusted data. The approach uses supervised finetuning on attack-injected prompts paired with clean (non-attacked) responses, teaching the model to produce correct outputs even when malicious instructions are present in the external content. Three response construction strategies are supported: using BIPIA ground-truth labels, using the original model's clean responses, or using GPT-4-generated clean responses. Training uses DeepSpeed ZeRO Stage 3 for distributed multi-GPU finetuning.

Usage

Execute this workflow when you have access to the model weights and want to permanently improve an LLM's robustness against indirect prompt injection attacks through finetuning. You need sufficient GPU resources (8 V100 GPUs recommended for the training phase), the BIPIA training data across all five task types, and optionally clean response files generated by the target model or GPT-4.

Execution Steps

Step 1: Clean Response Collection

Generate clean (non-attacked) responses from the target LLM or GPT-4 on prompts without injected attacks. This step is needed for the "self_clean" and "gpt4_clean" response strategies. The clean response collector loads the BIPIA dataset without attack injection (using the no_insert function), runs inference through the selected model, and saves responses as a JSONL file. Each response is keyed by its prompt content for later lookup during training data construction.

Key considerations:

Required only for "self_clean" and "gpt4_clean" strategies; "original" strategy uses BIPIA labels directly
Must be run separately for each of the five task types (email, qa, abstract, table, code)
The target model's own responses ("self_clean") may be less accurate but more stylistically consistent
GPT-4 responses ("gpt4_clean") are typically higher quality but require API access
Clean responses are matched to attacked prompts by stripping the attack from the context

Step 2: Training Data Construction

Build the supervised finetuning dataset by combining attack-injected prompts with clean target responses. For each sample, the AutoPIABuilder constructs the attack-injected context, then the response function selects the appropriate clean response based on the chosen strategy. The training data spans all five task types when dataset_name is set to "all", creating a comprehensive defense training set. Each sample is structured as a (user_prompt, clean_response) conversation pair.

Key considerations:

All five task types can be combined into a single training dataset
Attack insertion is performed normally (attacks are present in the training prompts)
The target response is always the clean answer (ignoring the attack)
This teaches the model to produce correct outputs despite attack presence
Ignore guidance instructions can optionally be included in prompts

Step 3: Tokenizer and Model Preparation

Load the base model and tokenizer from the YAML configuration, then add the special and tokens to the vocabulary. The tokenizer is extended with these new tokens, and the model's embedding layers are resized accordingly. New token embeddings are initialized to the mean of existing embeddings for stable training. The model is configured with gradient checkpointing enabled to reduce memory usage.

Key considerations:

Special tokens mark the boundaries of external (untrusted) content
Embedding resize initializes new tokens to the mean of existing embeddings
Padding token is set to the unknown token with right-side padding
Model cache is disabled for training compatibility
The Vicuna-style chat template (USER/ASSISTANT format) is used for prompt structure

Step 4: Training Data Tokenization

Tokenize each training sample by splitting the user prompt around the external context, inserting the special boundary tokens, and constructing the full input sequence. The sequence follows the format: BOS + system + USER: + pre_context + + context + + post_context + ASSISTANT: + response + EOS. Labels are masked (set to IGNORE_TOKEN_ID) for all tokens up to and including the ASSISTANT prefix, so the model is only trained to predict the clean response portion.

Key considerations:

The and tokens explicitly mark external content boundaries
Label masking ensures the model only learns to generate the response, not the prompt
Sequences exceeding the maximum length (default 2048 tokens) are filtered out
Tokenization is done at the sub-word level using the model's tokenizer

Step 5: Distributed Finetuning

Train the model using the HuggingFace Trainer with DeepSpeed ZeRO Stage 3 for distributed training across multiple GPUs. Training uses AdamW optimizer with cosine learning rate scheduling, warmup, and mixed-precision (FP16) training. The training loop supports checkpoint resumption and periodic saving. Training metrics can be tracked via Weights & Biases integration.

Key considerations:

DeepSpeed ZeRO Stage 3 shards optimizer states, gradients, and parameters across GPUs
Recommended: 8 V100 GPUs with FP16 training and gradient accumulation (steps=4)
Default training: 1000 steps with learning rate 2e-5, batch size 4 per device
Gradient checkpointing reduces memory at the cost of recomputation
Model is saved via state dict collection to handle DeepSpeed sharding

Step 6: Defended Model Evaluation

Run inference with the finetuned model on the BIPIA test set to generate defended responses. The evaluation script loads the finetuned checkpoint with the extended vocabulary (including / tokens), constructs test prompts with boundary markers, and generates responses. The VicunaWithSpecialToken wrapper automatically inserts the special tokens around external content in the prompt.

Key considerations:

The finetuned model path replaces the standard model configuration
Special context tokens are automatically inserted during prompt construction
The same inference pipeline (batched DataLoader, resume support) is used
Output format is compatible with the standard ASR evaluation pipeline

Step 7: ASR Evaluation

Compute the Attack Success Rate on the finetuned model's responses using the standard BipiaEvalFactory evaluation pipeline. Compare results against the undefended baseline to measure the effectiveness of the white-box defense. The evaluation uses the same run.py evaluate mode, enabling direct comparison.

Key considerations:

Uses the standard evaluation pipeline from the main benchmark workflow
Results are directly comparable with both undefended and black-box defended baselines
Per-attack-type analysis reveals which attacks the finetuning mitigates
ROUGE evaluation on clean responses can verify the model retains task performance

Execution Diagram

GitHub URL

Workflow Repository