Principle:Microsoft BIPIA Tokenizer And Model Preparation
| Field | Value |
|---|---|
| Sources | BIPIA paper |
| Domains | NLP, Model_Architecture, Defense |
| Last Updated | 2026-02-14 |
Overview
A model preparation methodology that extends a pretrained LLM's vocabulary with special boundary tokens and resizes its embeddings to enable content-aware prompt processing for defense against indirect prompt injection.
Description
White-box defense against indirect prompt injection requires the model to distinguish between trusted instructions issued by the user and untrusted external content retrieved from third-party sources. This principle is realized by adding special <data> and </data> tokens to the tokenizer vocabulary. These tokens do not exist in the original pretrained vocabulary, so the model's embedding layers (both input and output) must be resized to accommodate the new vocabulary size.
New token embeddings are initialized to the average of all existing embeddings. This provides a reasonable starting point for finetuning: the new vectors sit near the centroid of the existing embedding space rather than at random or zero-valued positions. As a result, the model does not suffer catastrophic degradation during the early steps of finetuning.
This approach gives the model an explicit, structured signal for content boundaries rather than relying on implicit prompt formatting conventions (such as triple quotes or natural-language delimiters) that an attacker can easily mimic or subvert.
Usage
Use this principle when preparing a pretrained LLM for white-box defense finetuning. The special boundary tokens create a structured signal that the model can learn to attend to during the finetuning phase. By encoding content boundaries directly into the tokenizer vocabulary, the defense mechanism becomes part of the model's representational capacity rather than an external post-processing step.
Theoretical Basis
Vocabulary extension adds new token IDs to the tokenizer. If the original vocabulary size is V, adding k special tokens yields a new vocabulary of size V + k.
Embedding resizing extends the weight matrices of both the input embedding layer and the output projection (language model head) from dimensions (V, d) to (V + k, d), where d is the hidden dimension of the model.
Average initialization computes the new embedding vectors as:
new_emb = mean(existing_embeddings, dim=0)
This ensures the new tokens start in a reasonable region of the embedding space (near the centroid) rather than at random values. Starting from random initialization could cause large, unpredictable gradients in the early finetuning steps, potentially destabilizing training.
The <data> and </data> tokens act as structured markers analogous to HTML tags, providing explicit content boundaries that the model can learn to recognize. Unlike natural-language delimiters, these tokens occupy unique positions in the vocabulary and cannot be confused with ordinary text produced by an attacker.