Principle:Microsoft BIPIA Black Box Defense Configuration
| Field | Value |
|---|---|
| Sources | BIPIA: Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models |
| Domains | NLP, Security, Defense |
| Last Updated | 2026-02-14 |
Overview
A meta-prompting defense strategy that combines border string delimiters and in-context learning examples to teach API-based LLMs to distinguish between trusted user instructions and untrusted external content.
Description
The black-box defense is a prompt-level strategy designed to protect large language models against indirect prompt injection attacks without requiring access to model weights. The approach operates entirely through the prompting interface exposed by commercial APIs (e.g., OpenAI's GPT-3.5 and GPT-4), making it applicable in settings where fine-tuning or weight modification is not possible.
The defense has two complementary components:
(1) Border strings act as visual delimiters that wrap external (untrusted) content within the prompt. By surrounding retrieved context with explicit boundary markers, the model receives a structural cue that separates trusted user instructions from potentially adversarial external text. This is analogous to how HTML tags demarcate content regions in web documents. For example, a border string might wrap a retrieved email body with === markers above and below it, signalling to the model that everything inside the markers is external data and should not be interpreted as instructions.
(2) Few-shot in-context learning provides the model with example conversations that demonstrate the correct behavior when faced with injected attacks. These examples show the model receiving a prompt that contains an embedded injection attempt within the external content, and then responding by ignoring the injected instruction and answering only the legitimate user query. By observing these demonstrations, the model learns the behavioral pattern of treating bordered external content as data rather than directives.
The combined approach leverages both structural cues (border strings) and behavioral cues (few-shot examples) to harden the model against indirect prompt injection. The defense wraps GPT-3.5 and GPT-4 API models and modifies only the prompting strategy; no model weights are changed.
Usage
Use the black-box defense configuration when defending API-based LLMs (GPT-3.5, GPT-4) against indirect prompt injection attacks and you do not have access to model weights. The approach is purely prompt-based and requires no fine-tuning. It is suitable for any pipeline that retrieves external content (emails, web pages, documents) and incorporates it into the model's context window.
Theoretical Basis
The meta-prompting defense rests on two theoretical pillars:
Border strings create visual separators that help models identify content boundaries within a prompt. Just as HTML tags in web documents demarcate regions of content with distinct semantics, border strings demarcate the boundary between trusted instructions and untrusted external data. The BIPIA benchmark evaluates four border types:
| Border Type | Marker | Description |
|---|---|---|
empty |
(none) | No border is applied; the external content is inserted directly into the prompt without any delimiter. |
= |
====== |
A line of equals signs is placed above and below the external content. |
- |
------ |
A line of dashes is placed above and below the external content. |
code |
``` |
Triple backticks wrap the external content, mimicking a code fence. |
Few-shot examples provide demonstrations of correct behavior. Each example consists of a user prompt containing an embedded injection attempt within bordered external content, paired with a model response that ignores the injected attack and answers only the legitimate query. By observing these demonstrations, the model learns to treat bordered external content as inert data.
The combined strategy leverages both structural (borders) and behavioral (examples) cues. Borders provide a static, syntactic signal of content boundaries, while few-shot examples provide a dynamic, semantic signal of expected behavior. Together, they reinforce the model's ability to distinguish trusted instructions from untrusted content without any modification to the model itself.