Principle:Microsoft BIPIA Dataset Preparation
| Field | Value |
|---|---|
| Sources | BIPIA: Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models |
| Domains | NLP, Security, Benchmarking |
| Last Updated | 2026-02-14 |
Overview
A data construction methodology that systematically combines clean task contexts with adversarial prompt injection attacks to create evaluation datasets for LLM robustness benchmarking.
Description
Dataset Preparation in the BIPIA framework refers to the process of constructing poisoned datasets for evaluating large language model robustness against indirect prompt injection attacks. The core technique takes clean context data drawn from realistic sources -- including emails, web pages, code snippets, tables, and academic abstracts -- and injects adversarial attack strings at configurable positions (start, middle, or end of the context) to produce evaluation samples.
The construction follows a cross-product design: every combination of context, attack, and insertion position yields a distinct sample. This ensures comprehensive coverage across all interaction patterns between benign content and adversarial payloads. The clean contexts are sourced from established datasets and curated to represent the kinds of external content that LLMs encounter during retrieval-augmented generation or tool-assisted workflows. The attack strings are drawn from a taxonomy of 26 distinct attack types that span categories such as direct instruction injection, context manipulation, and role-playing exploits.
By separating the context data, attack data, and insertion logic into independent components, the methodology supports modular experimentation. Researchers can swap in new attack types, add new task domains, or adjust insertion strategies without restructuring the entire pipeline.
Usage
Use this principle when building evaluation benchmarks for LLM robustness against indirect prompt injection. The approach supports 5 task types (code completion, question answering, table reasoning, email processing, and abstract summarization) and 26 attack types. It is appropriate whenever a systematic, combinatorial evaluation of model behavior under adversarial conditions is needed, particularly when the goal is to measure how reliably a model can ignore injected instructions embedded within otherwise legitimate external content.
Theoretical Basis
The dataset construction rests on a combinatorial design. Given N contexts, M attacks, and P insertion positions, the resulting dataset contains N x M x P samples. Each sample preserves the original task metadata -- including the ideal answer and, for QA tasks, the associated question -- while injecting the attack string at the specified position within the context.
The insertion logic operates as follows:
- Start insertion -- the attack string is prepended to the context.
- End insertion -- the attack string is appended to the context.
- Middle insertion -- the context is segmented into sentences using the NLTK PunktSentenceTokenizer, and the attack string is inserted at a sentence boundary near the midpoint. This approach ensures that the injected content appears at a syntactically natural break point rather than splitting a sentence arbitrarily.
This combinatorial framework guarantees that every attack is tested against every context in every position, eliminating sampling bias and enabling fine-grained analysis of which attack-context-position combinations are most effective at misleading the target model.
The design also supports an optional stealth mode in which attack strings are base64-encoded before insertion, testing whether models can be exploited through obfuscated payloads that may bypass surface-level filtering.