Implementation:Microsoft LoRA Run Chinese Ref

Overview

The run_chinese_ref.py script generates Chinese whole-word masking reference files for BERT-based Masked Language Modeling (MLM) by combining LTP word segmentation with BERT tokenization.

Description

This script addresses a key challenge in Chinese MLM pre-training: BERT's WordPiece tokenizer splits Chinese text character-by-character, losing word boundary information needed for whole-word masking (WWM). The script bridges this gap by:

Using the LTP (Language Technology Platform) tokenizer to perform Chinese word segmentation on the training corpus.
Using the BERT tokenizer to produce subword token IDs for the same text.
Comparing the two tokenizations to identify which BERT subtokens are internal to a Chinese word (marked with ## prefix).
Outputting a reference file where each line is a JSON array of token positions that should be masked together during WWM.

Key functions:

_is_chinese_char(cp): Checks if a Unicode code point falls within CJK Unicode blocks (U+4E00-U+9FFF, U+3400-U+4DBF, etc.).
is_chinese(word): Returns 1 if every character in the word is Chinese, 0 otherwise.
get_chinese_word(tokens): Filters a token list to return only multi-character Chinese words.
add_sub_symbol(bert_tokens, chinese_word_set): Marks internal characters of Chinese words with ## prefix in the BERT token list using a greedy longest-match approach.
prepare_ref(lines, ltp_tokenizer, bert_tokenizer): Orchestrates the full pipeline, processing lines in batches of 100, and returns reference IDs (positions of ##-prefixed Chinese subtokens).

⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.

Usage

Use this script when:

Preparing whole-word masking reference files for Chinese BERT or RoBERTa pre-training (e.g., RoBERTa-wwm-ext).
Fine-tuning Chinese language models that require aligned word boundary information.
The target model was trained with LTP-based word segmentation and the same tokenization convention must be reproduced.

Code Reference

Source Location

examples/NLU/examples/legacy/run_chinese_ref.py (148 lines)

Signature

def _is_chinese_char(cp: int) -> bool: ...
def is_chinese(word: str) -> int: ...
def get_chinese_word(tokens: List[str]) -> list: ...
def add_sub_symbol(bert_tokens: List[str], chinese_word_set: set) -> List[str]: ...
def prepare_ref(lines: List[str], ltp_tokenizer: LTP, bert_tokenizer: BertTokenizer) -> list: ...
def main(args) -> None: ...

Import / CLI Usage

python examples/legacy/run_chinese_ref.py \
    --file_name ./resources/chinese-demo.txt \
    --ltp ./resources/ltp \
    --bert ./resources/robert \
    --save_path ./resources/ref.txt

I/O Contract

Inputs

Input	Type	Description
`--file_name`	str (file path)	Path to the Chinese text corpus (one sentence per line), same as MLM training data. Default: `./resources/chinese-demo.txt`
`--ltp`	str (path)	Path to LTP model resources for Chinese word segmentation. Default: `./resources/ltp`
`--bert`	str (path)	Path to BERT tokenizer resources (e.g., RoBERTa-wwm-ext vocabulary). Default: `./resources/robert`
`--save_path`	str (file path)	Output path for the reference file. Default: `./resources/ref.txt`

Outputs

Output	Type	Description
Reference file	Text file	Each line is a JSON array of integer positions indicating which BERT subtokens are internal parts of Chinese whole words (i.e., positions where `##` prefix was added). Example line: `[3, 5, 8]`

Usage Examples

# Generate reference file for Chinese whole-word masking
python examples/legacy/run_chinese_ref.py \
    --file_name /data/chinese_corpus.txt \
    --ltp /models/ltp \
    --bert /models/chinese-roberta-wwm-ext \
    --save_path /data/ref.txt

# The output file will contain one JSON array per line:
# [2, 5, 9]
# [3, 4, 7, 8]
# []
# Each array lists token positions that are internal to Chinese words
# and should be masked together during whole-word masking.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment