Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI Expand Vocab Tokenizer

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Engineering
Last Updated 2026-02-09 00:00 GMT

Overview

CLI tool for expanding LLaMA tokenizer vocabulary with new tokens, provided by Colossal-LLaMA.

Description

expand_vocab_tokenizer() reads a SentencePiece model, adds new tokens from a JSONL file, and saves the expanded tokenizer. The CLI wrapper init_tokenizer.py provides command-line access.

Usage

Run before continual pretraining to create an expanded tokenizer for multilingual training.

Code Reference

Source Location

  • Repository: ColossalAI
  • File: applications/Colossal-LLaMA/colossal_llama/tokenizer/init_tokenizer.py
  • Lines: 23-98

Signature

def expand_vocab_tokenizer(
    source_tokenizer_dir: Union[str, os.PathLike],
    target_tokenizer_dir: Union[str, os.PathLike],
    new_tokens: List[str],
) -> None:
    """
    Expand LLaMA tokenizer vocabulary with new tokens.

    Args:
        source_tokenizer_dir: Source LLaMA tokenizer directory
        target_tokenizer_dir: Output directory for expanded tokenizer
        new_tokens: List of new tokens to add
    """

Import

from colossal_llama.tokenizer.init_tokenizer import expand_vocab_tokenizer

I/O Contract

Inputs

Name Type Required Description
source_tokenizer_dir str Yes Path to original LLaMA tokenizer
target_tokenizer_dir str Yes Output path for expanded tokenizer
new_tokens List[str] Yes New tokens to add (from JSONL with {"piece": "token"})

Outputs

Name Type Description
Expanded tokenizer Directory New tokenizer with expanded vocabulary SentencePiece model and config files

Usage Examples

python applications/Colossal-LLaMA/colossal_llama/tokenizer/init_tokenizer.py \
    --source_tokenizer_dir /models/llama-7b/tokenizer \
    --target_tokenizer_dir /models/llama-7b-chinese/tokenizer \
    --expand_tokens_file /data/chinese_tokens.jsonl

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment