Implementation:Hpcaitech ColossalAI Expand Vocab Tokenizer
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Engineering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
CLI tool for expanding LLaMA tokenizer vocabulary with new tokens, provided by Colossal-LLaMA.
Description
expand_vocab_tokenizer() reads a SentencePiece model, adds new tokens from a JSONL file, and saves the expanded tokenizer. The CLI wrapper init_tokenizer.py provides command-line access.
Usage
Run before continual pretraining to create an expanded tokenizer for multilingual training.
Code Reference
Source Location
- Repository: ColossalAI
- File: applications/Colossal-LLaMA/colossal_llama/tokenizer/init_tokenizer.py
- Lines: 23-98
Signature
def expand_vocab_tokenizer(
source_tokenizer_dir: Union[str, os.PathLike],
target_tokenizer_dir: Union[str, os.PathLike],
new_tokens: List[str],
) -> None:
"""
Expand LLaMA tokenizer vocabulary with new tokens.
Args:
source_tokenizer_dir: Source LLaMA tokenizer directory
target_tokenizer_dir: Output directory for expanded tokenizer
new_tokens: List of new tokens to add
"""
Import
from colossal_llama.tokenizer.init_tokenizer import expand_vocab_tokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| source_tokenizer_dir | str | Yes | Path to original LLaMA tokenizer |
| target_tokenizer_dir | str | Yes | Output path for expanded tokenizer |
| new_tokens | List[str] | Yes | New tokens to add (from JSONL with {"piece": "token"}) |
Outputs
| Name | Type | Description |
|---|---|---|
| Expanded tokenizer | Directory | New tokenizer with expanded vocabulary SentencePiece model and config files |
Usage Examples
python applications/Colossal-LLaMA/colossal_llama/tokenizer/init_tokenizer.py \
--source_tokenizer_dir /models/llama-7b/tokenizer \
--target_tokenizer_dir /models/llama-7b-chinese/tokenizer \
--expand_tokens_file /data/chinese_tokens.jsonl
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment