Implementation:Hpcaitech ColossalAI Expand Vocab Tokenizer

Knowledge Sources	ColossalAI
Domains	NLP, Data_Engineering
Last Updated	2026-02-09 00:00 GMT

Overview

CLI tool for expanding LLaMA tokenizer vocabulary with new tokens, provided by Colossal-LLaMA.

Description

expand_vocab_tokenizer() reads a SentencePiece model, adds new tokens from a JSONL file, and saves the expanded tokenizer. The CLI wrapper init_tokenizer.py provides command-line access.

Usage

Run before continual pretraining to create an expanded tokenizer for multilingual training.

Code Reference

Source Location

Repository: ColossalAI
File: applications/Colossal-LLaMA/colossal_llama/tokenizer/init_tokenizer.py
Lines: 23-98

Signature

def expand_vocab_tokenizer(
    source_tokenizer_dir: Union[str, os.PathLike],
    target_tokenizer_dir: Union[str, os.PathLike],
    new_tokens: List[str],
) -> None:
    """
    Expand LLaMA tokenizer vocabulary with new tokens.

    Args:
        source_tokenizer_dir: Source LLaMA tokenizer directory
        target_tokenizer_dir: Output directory for expanded tokenizer
        new_tokens: List of new tokens to add
    """

Import

from colossal_llama.tokenizer.init_tokenizer import expand_vocab_tokenizer

I/O Contract

Inputs

Name	Type	Required	Description
source_tokenizer_dir	str	Yes	Path to original LLaMA tokenizer
target_tokenizer_dir	str	Yes	Output path for expanded tokenizer
new_tokens	List[str]	Yes	New tokens to add (from JSONL with {"piece": "token"})

Outputs

Name	Type	Description
Expanded tokenizer	Directory	New tokenizer with expanded vocabulary SentencePiece model and config files

Usage Examples

python applications/Colossal-LLaMA/colossal_llama/tokenizer/init_tokenizer.py \
    --source_tokenizer_dir /models/llama-7b/tokenizer \
    --target_tokenizer_dir /models/llama-7b-chinese/tokenizer \
    --expand_tokens_file /data/chinese_tokens.jsonl

Related Pages

Implements Principle

Principle:Hpcaitech_ColossalAI_Tokenizer_Vocabulary_Expansion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment