Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA Run Chinese Ref

From Leeroopedia


Template:Implementation meta

Overview

The run_chinese_ref.py script generates Chinese whole-word masking reference files for BERT-based Masked Language Modeling (MLM) by combining LTP word segmentation with BERT tokenization.

Description

This script addresses a key challenge in Chinese MLM pre-training: BERT's WordPiece tokenizer splits Chinese text character-by-character, losing word boundary information needed for whole-word masking (WWM). The script bridges this gap by:

  1. Using the LTP (Language Technology Platform) tokenizer to perform Chinese word segmentation on the training corpus.
  2. Using the BERT tokenizer to produce subword token IDs for the same text.
  3. Comparing the two tokenizations to identify which BERT subtokens are internal to a Chinese word (marked with ## prefix).
  4. Outputting a reference file where each line is a JSON array of token positions that should be masked together during WWM.

Key functions:

  • _is_chinese_char(cp): Checks if a Unicode code point falls within CJK Unicode blocks (U+4E00-U+9FFF, U+3400-U+4DBF, etc.).
  • is_chinese(word): Returns 1 if every character in the word is Chinese, 0 otherwise.
  • get_chinese_word(tokens): Filters a token list to return only multi-character Chinese words.
  • add_sub_symbol(bert_tokens, chinese_word_set): Marks internal characters of Chinese words with ## prefix in the BERT token list using a greedy longest-match approach.
  • prepare_ref(lines, ltp_tokenizer, bert_tokenizer): Orchestrates the full pipeline, processing lines in batches of 100, and returns reference IDs (positions of ##-prefixed Chinese subtokens).

⚠️ DEPRECATED: This file resides in the legacy/ directory and is not actively maintained. Prefer modern equivalents where available.

Usage

Use this script when:

  • Preparing whole-word masking reference files for Chinese BERT or RoBERTa pre-training (e.g., RoBERTa-wwm-ext).
  • Fine-tuning Chinese language models that require aligned word boundary information.
  • The target model was trained with LTP-based word segmentation and the same tokenization convention must be reproduced.

Code Reference

Source Location

examples/NLU/examples/legacy/run_chinese_ref.py (148 lines)

Signature

def _is_chinese_char(cp: int) -> bool: ...
def is_chinese(word: str) -> int: ...
def get_chinese_word(tokens: List[str]) -> list: ...
def add_sub_symbol(bert_tokens: List[str], chinese_word_set: set) -> List[str]: ...
def prepare_ref(lines: List[str], ltp_tokenizer: LTP, bert_tokenizer: BertTokenizer) -> list: ...
def main(args) -> None: ...

Import / CLI Usage

python examples/legacy/run_chinese_ref.py \
    --file_name ./resources/chinese-demo.txt \
    --ltp ./resources/ltp \
    --bert ./resources/robert \
    --save_path ./resources/ref.txt

I/O Contract

Inputs

Input Type Description
--file_name str (file path) Path to the Chinese text corpus (one sentence per line), same as MLM training data. Default: ./resources/chinese-demo.txt
--ltp str (path) Path to LTP model resources for Chinese word segmentation. Default: ./resources/ltp
--bert str (path) Path to BERT tokenizer resources (e.g., RoBERTa-wwm-ext vocabulary). Default: ./resources/robert
--save_path str (file path) Output path for the reference file. Default: ./resources/ref.txt

Outputs

Output Type Description
Reference file Text file Each line is a JSON array of integer positions indicating which BERT subtokens are internal parts of Chinese whole words (i.e., positions where ## prefix was added). Example line: [3, 5, 8]

Usage Examples

# Generate reference file for Chinese whole-word masking
python examples/legacy/run_chinese_ref.py \
    --file_name /data/chinese_corpus.txt \
    --ltp /models/ltp \
    --bert /models/chinese-roberta-wwm-ext \
    --save_path /data/ref.txt

# The output file will contain one JSON array per line:
# [2, 5, 9]
# [3, 4, 7, 8]
# []
# Each array lists token positions that are internal to Chinese words
# and should be masked together during whole-word masking.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment