Implementation:Microsoft LoRA Run Chinese Ref
Overview
The run_chinese_ref.py script generates Chinese whole-word masking reference files for BERT-based Masked Language Modeling (MLM) by combining LTP word segmentation with BERT tokenization.
Description
This script addresses a key challenge in Chinese MLM pre-training: BERT's WordPiece tokenizer splits Chinese text character-by-character, losing word boundary information needed for whole-word masking (WWM). The script bridges this gap by:
- Using the LTP (Language Technology Platform) tokenizer to perform Chinese word segmentation on the training corpus.
- Using the BERT tokenizer to produce subword token IDs for the same text.
- Comparing the two tokenizations to identify which BERT subtokens are internal to a Chinese word (marked with
##prefix). - Outputting a reference file where each line is a JSON array of token positions that should be masked together during WWM.
Key functions:
_is_chinese_char(cp): Checks if a Unicode code point falls within CJK Unicode blocks (U+4E00-U+9FFF, U+3400-U+4DBF, etc.).is_chinese(word): Returns 1 if every character in the word is Chinese, 0 otherwise.get_chinese_word(tokens): Filters a token list to return only multi-character Chinese words.add_sub_symbol(bert_tokens, chinese_word_set): Marks internal characters of Chinese words with##prefix in the BERT token list using a greedy longest-match approach.prepare_ref(lines, ltp_tokenizer, bert_tokenizer): Orchestrates the full pipeline, processing lines in batches of 100, and returns reference IDs (positions of##-prefixed Chinese subtokens).
⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.
Usage
Use this script when:
- Preparing whole-word masking reference files for Chinese BERT or RoBERTa pre-training (e.g., RoBERTa-wwm-ext).
- Fine-tuning Chinese language models that require aligned word boundary information.
- The target model was trained with LTP-based word segmentation and the same tokenization convention must be reproduced.
Code Reference
Source Location
examples/NLU/examples/legacy/run_chinese_ref.py (148 lines)
Signature
def _is_chinese_char(cp: int) -> bool: ... def is_chinese(word: str) -> int: ... def get_chinese_word(tokens: List[str]) -> list: ... def add_sub_symbol(bert_tokens: List[str], chinese_word_set: set) -> List[str]: ... def prepare_ref(lines: List[str], ltp_tokenizer: LTP, bert_tokenizer: BertTokenizer) -> list: ... def main(args) -> None: ...
Import / CLI Usage
python examples/legacy/run_chinese_ref.py \
--file_name ./resources/chinese-demo.txt \
--ltp ./resources/ltp \
--bert ./resources/robert \
--save_path ./resources/ref.txt
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
--file_name |
str (file path) | Path to the Chinese text corpus (one sentence per line), same as MLM training data. Default: ./resources/chinese-demo.txt
|
--ltp |
str (path) | Path to LTP model resources for Chinese word segmentation. Default: ./resources/ltp
|
--bert |
str (path) | Path to BERT tokenizer resources (e.g., RoBERTa-wwm-ext vocabulary). Default: ./resources/robert
|
--save_path |
str (file path) | Output path for the reference file. Default: ./resources/ref.txt
|
Outputs
| Output | Type | Description |
|---|---|---|
| Reference file | Text file | Each line is a JSON array of integer positions indicating which BERT subtokens are internal parts of Chinese whole words (i.e., positions where ## prefix was added). Example line: [3, 5, 8]
|
Usage Examples
# Generate reference file for Chinese whole-word masking
python examples/legacy/run_chinese_ref.py \
--file_name /data/chinese_corpus.txt \
--ltp /models/ltp \
--bert /models/chinese-roberta-wwm-ext \
--save_path /data/ref.txt
# The output file will contain one JSON array per line:
# [2, 5, 9]
# [3, 4, 7, 8]
# []
# Each array lists token positions that are internal to Chinese words
# and should be masked together during whole-word masking.