Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft BIPIA Tokenizer And Model Preparation

From Leeroopedia
Field Value
Sources BIPIA paper
Domains NLP, Model_Architecture, Defense
Last Updated 2026-02-14

Overview

A model preparation methodology that extends a pretrained LLM's vocabulary with special boundary tokens and resizes its embeddings to enable content-aware prompt processing for defense against indirect prompt injection.

Description

White-box defense against indirect prompt injection requires the model to distinguish between trusted instructions issued by the user and untrusted external content retrieved from third-party sources. This principle is realized by adding special <data> and </data> tokens to the tokenizer vocabulary. These tokens do not exist in the original pretrained vocabulary, so the model's embedding layers (both input and output) must be resized to accommodate the new vocabulary size.

New token embeddings are initialized to the average of all existing embeddings. This provides a reasonable starting point for finetuning: the new vectors sit near the centroid of the existing embedding space rather than at random or zero-valued positions. As a result, the model does not suffer catastrophic degradation during the early steps of finetuning.

This approach gives the model an explicit, structured signal for content boundaries rather than relying on implicit prompt formatting conventions (such as triple quotes or natural-language delimiters) that an attacker can easily mimic or subvert.

Usage

Use this principle when preparing a pretrained LLM for white-box defense finetuning. The special boundary tokens create a structured signal that the model can learn to attend to during the finetuning phase. By encoding content boundaries directly into the tokenizer vocabulary, the defense mechanism becomes part of the model's representational capacity rather than an external post-processing step.

Theoretical Basis

Vocabulary extension adds new token IDs to the tokenizer. If the original vocabulary size is V, adding k special tokens yields a new vocabulary of size V + k.

Embedding resizing extends the weight matrices of both the input embedding layer and the output projection (language model head) from dimensions (V, d) to (V + k, d), where d is the hidden dimension of the model.

Average initialization computes the new embedding vectors as:

new_emb = mean(existing_embeddings, dim=0)

This ensures the new tokens start in a reasonable region of the embedding space (near the centroid) rather than at random values. Starting from random initialization could cause large, unpredictable gradients in the early finetuning steps, potentially destabilizing training.

The <data> and </data> tokens act as structured markers analogous to HTML tags, providing explicit content boundaries that the model can learn to recognize. Unlike natural-language delimiters, these tokens occupy unique positions in the vocabulary and cannot be confused with ordinary text produced by an attacker.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment