Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Hpcaitech ColossalAI Tokenizer Vocabulary Expansion

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Engineering
Last Updated 2026-02-09 00:00 GMT

Overview

A tokenizer augmentation technique that expands a pretrained LLaMA tokenizer's vocabulary with new tokens for improved multilingual or domain-specific tokenization efficiency.

Description

When continually pretraining a LLaMA model on a new language (e.g., Chinese) or domain, the existing tokenizer may be inefficient, requiring many subword tokens per character. Vocabulary expansion adds new tokens to the SentencePiece model, improving tokenization efficiency and reducing sequence lengths. This requires corresponding expansion of the model's embedding and output projection layers.

Usage

Use this principle before continual pretraining when the target data contains significant out-of-vocabulary content (e.g., Chinese characters for an English LLaMA model).

Theoretical Basis

The expansion process:

  1. Load existing SentencePiece model protobuf
  2. Add new token pieces with default score
  3. Save expanded SentencePiece model
  4. Regenerate HuggingFace tokenizer files from the new model
  5. Model embedding layers must be resized to match new vocabulary size

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment