Principle:Hpcaitech ColossalAI Tokenizer Vocabulary Expansion
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Engineering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A tokenizer augmentation technique that expands a pretrained LLaMA tokenizer's vocabulary with new tokens for improved multilingual or domain-specific tokenization efficiency.
Description
When continually pretraining a LLaMA model on a new language (e.g., Chinese) or domain, the existing tokenizer may be inefficient, requiring many subword tokens per character. Vocabulary expansion adds new tokens to the SentencePiece model, improving tokenization efficiency and reducing sequence lengths. This requires corresponding expansion of the model's embedding and output projection layers.
Usage
Use this principle before continual pretraining when the target data contains significant out-of-vocabulary content (e.g., Chinese characters for an English LLaMA model).
Theoretical Basis
The expansion process:
- Load existing SentencePiece model protobuf
- Add new token pieces with default score
- Save expanded SentencePiece model
- Regenerate HuggingFace tokenizer files from the new model
- Model embedding layers must be resized to match new vocabulary size