Principle:Hpcaitech ColossalAI Tokenizer Vocabulary Expansion

Knowledge Sources	ColossalAI SentencePiece
Domains	NLP, Data_Engineering
Last Updated	2026-02-09 00:00 GMT

Overview

A tokenizer augmentation technique that expands a pretrained LLaMA tokenizer's vocabulary with new tokens for improved multilingual or domain-specific tokenization efficiency.

Description

When continually pretraining a LLaMA model on a new language (e.g., Chinese) or domain, the existing tokenizer may be inefficient, requiring many subword tokens per character. Vocabulary expansion adds new tokens to the SentencePiece model, improving tokenization efficiency and reducing sequence lengths. This requires corresponding expansion of the model's embedding and output projection layers.

Usage

Use this principle before continual pretraining when the target data contains significant out-of-vocabulary content (e.g., Chinese characters for an English LLaMA model).

Theoretical Basis

The expansion process:

Load existing SentencePiece model protobuf
Add new token pieces with default score
Save expanded SentencePiece model
Regenerate HuggingFace tokenizer files from the new model
Model embedding layers must be resized to match new vocabulary size

Related Pages

Implemented By

Implementation:Hpcaitech_ColossalAI_Expand_Vocab_Tokenizer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment