Implementation:Microsoft BIPIA Smart Tokenizer And Embedding Resize
| Field | Value |
|---|---|
| Sources | BIPIA repository |
| Domains | NLP, Model_Architecture, Defense |
| Last Updated | 2026-02-14 |
Overview
Concrete tool for extending tokenizer vocabulary and resizing model embeddings with averaged initialization provided by the BIPIA defense module.
Description
The smart_tokenizer_and_embedding_resize() function performs three steps in sequence:
- Adds special tokens to the tokenizer via
tokenizer.add_tokens(), which assigns new unique token IDs for each provided special token string. - Resizes the model's token embeddings via
model.resize_token_embeddings(), expanding both the input embedding matrix and the output language model head to accommodate the enlarged vocabulary. - Initializes the new embeddings to the mean of all existing embeddings for both the input embedding layer and the output embedding layer. This averaged initialization ensures the new token representations start in a reasonable region of the embedding space.
The broader setup also handles tokenizer and model loading. The tokenizer is loaded via AutoTokenizer.from_pretrained() with right padding configured (padding_side="right") and slow tokenizer mode (use_fast=False). The model is loaded via AutoModelForCausalLM.from_pretrained() with an optional cache directory.
Usage
Called during white-box defense setup, after loading the base pretrained model and tokenizer but before creating the training dataset. The special tokens must be added before any data is tokenized so that the tokenizer correctly encodes the boundary markers during dataset preparation.
Code Reference
Source: BIPIA repository, file defense/white_box/finetune.py
- Lines L171-195:
smart_tokenizer_and_embedding_resize()function definition - Lines L497-519: Tokenizer and model loading block
Function signature:
def smart_tokenizer_and_embedding_resize(
special_tokens_list: List,
tokenizer: transformers.PreTrainedTokenizer,
model: transformers.PreTrainedModel,
) -> None:
Tokenizer and model loading:
transformers.AutoTokenizer.from_pretrained(
model_name,
model_max_length=...,
padding_side="right",
use_fast=False,
)
transformers.AutoModelForCausalLM.from_pretrained(
model_name,
cache_dir=...,
)
Import: smart_tokenizer_and_embedding_resize is an internal function defined in defense/white_box/finetune.py. For the transformers APIs:
from transformers import AutoTokenizer, AutoModelForCausalLM
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
special_tokens_list |
List[str] |
Yes | Tokens to add to the vocabulary, e.g. ["", ""]
|
tokenizer |
PreTrainedTokenizer |
Yes | The tokenizer instance to extend with new tokens |
model |
PreTrainedModel |
Yes | The model whose embedding layers will be resized |
Outputs:
None. The function modifies the tokenizer and model in-place:
- The tokenizer gains new token IDs for each added special token.
- The model embedding layers are resized from
(V, d)to(V+k, d), where V is the original vocabulary size, k is the number of added tokens, and d is the hidden dimension.
Usage Examples
from transformers import AutoTokenizer, AutoModelForCausalLM
# Step 1: Load the pretrained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Llama-2-7b-hf",
model_max_length=512,
padding_side="right",
use_fast=False,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
cache_dir="/path/to/cache",
)
# Step 2: Add boundary tokens and resize embeddings
smart_tokenizer_and_embedding_resize(
special_tokens_list=["<data>", "</data>"],
tokenizer=tokenizer,
model=model,
)
# After this call:
# - tokenizer.vocab_size has increased by 2
# - tokenizer.encode("<data>") returns a valid single-token ID
# - model input/output embedding dimensions match the new vocab size
# - New embeddings are initialized to the mean of existing embeddings