Implementation:Microsoft BIPIA Smart Tokenizer And Embedding Resize

Field	Value
Sources	BIPIA repository
Domains	NLP, Model_Architecture, Defense
Last Updated	2026-02-14

Overview

Concrete tool for extending tokenizer vocabulary and resizing model embeddings with averaged initialization provided by the BIPIA defense module.

Description

The smart_tokenizer_and_embedding_resize() function performs three steps in sequence:

Adds special tokens to the tokenizer via tokenizer.add_tokens(), which assigns new unique token IDs for each provided special token string.
Resizes the model's token embeddings via model.resize_token_embeddings(), expanding both the input embedding matrix and the output language model head to accommodate the enlarged vocabulary.
Initializes the new embeddings to the mean of all existing embeddings for both the input embedding layer and the output embedding layer. This averaged initialization ensures the new token representations start in a reasonable region of the embedding space.

The broader setup also handles tokenizer and model loading. The tokenizer is loaded via AutoTokenizer.from_pretrained() with right padding configured (padding_side="right") and slow tokenizer mode (use_fast=False). The model is loaded via AutoModelForCausalLM.from_pretrained() with an optional cache directory.

Usage

Called during white-box defense setup, after loading the base pretrained model and tokenizer but before creating the training dataset. The special tokens must be added before any data is tokenized so that the tokenizer correctly encodes the boundary markers during dataset preparation.

Code Reference

Source: BIPIA repository, file defense/white_box/finetune.py

Lines L171-195: smart_tokenizer_and_embedding_resize() function definition
Lines L497-519: Tokenizer and model loading block

Function signature:

def smart_tokenizer_and_embedding_resize(
    special_tokens_list: List,
    tokenizer: transformers.PreTrainedTokenizer,
    model: transformers.PreTrainedModel,
) -> None:

Tokenizer and model loading:

transformers.AutoTokenizer.from_pretrained(
    model_name,
    model_max_length=...,
    padding_side="right",
    use_fast=False,
)

transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    cache_dir=...,
)

Import: smart_tokenizer_and_embedding_resize is an internal function defined in defense/white_box/finetune.py. For the transformers APIs:

from transformers import AutoTokenizer, AutoModelForCausalLM

I/O Contract

Inputs:

Parameter	Type	Required	Description
`special_tokens_list`	`List[str]`	Yes	Tokens to add to the vocabulary, e.g. `["", ""]`
`tokenizer`	`PreTrainedTokenizer`	Yes	The tokenizer instance to extend with new tokens
`model`	`PreTrainedModel`	Yes	The model whose embedding layers will be resized

Outputs:

None. The function modifies the tokenizer and model in-place:

The tokenizer gains new token IDs for each added special token.
The model embedding layers are resized from (V, d) to (V+k, d), where V is the original vocabulary size, k is the number of added tokens, and d is the hidden dimension.

Usage Examples

from transformers import AutoTokenizer, AutoModelForCausalLM

# Step 1: Load the pretrained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    model_max_length=512,
    padding_side="right",
    use_fast=False,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    cache_dir="/path/to/cache",
)

# Step 2: Add boundary tokens and resize embeddings
smart_tokenizer_and_embedding_resize(
    special_tokens_list=["<data>", "</data>"],
    tokenizer=tokenizer,
    model=model,
)

# After this call:
#   - tokenizer.vocab_size has increased by 2
#   - tokenizer.encode("<data>") returns a valid single-token ID
#   - model input/output embedding dimensions match the new vocab size
#   - New embeddings are initialized to the mean of existing embeddings

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment