Implementation:Microsoft LoRA Check Tokenizers

Overview

The check_tokenizers.py script validates that fast (Rust-based) tokenizer implementations produce equivalent outputs to their slow (Python-based) counterparts across all convertible tokenizer classes using the multilingual XNLI dataset.

Description

HuggingFace Transformers provides two tokenizer implementations for most models: a slow Python-based tokenizer and a fast Rust-based tokenizer (from the tokenizers library). This script systematically validates that both implementations produce identical token IDs for the same input text across a diverse multilingual corpus.

Test Setup:

Dynamically discovers all tokenizer pairs from SLOW_TO_FAST_CONVERTERS in transformers.convert_slow_tokenizer.
Loads the XNLI dataset (test + validation splits), which covers 15 languages, providing comprehensive multilingual coverage.
For each tokenizer class, iterates over all registered checkpoint names.

Validation Logic:

test_string(slow, fast, text): Encodes the same text with both tokenizers and compares the resulting token IDs. Tracks three categories: perfect matches, imperfect matches (acceptable divergences), and wrong matches (real errors).
check_diff(spm_diff, tok_diff, slow, fast): Identifies acceptable tokenization differences:
- Reversed order: AAA -> AA+A vs A+AA (both valid segmentations).
- Second-order equivalence: Different subword splits that decode to the same string (e.g., Barr+ich vs Bar+rich).
- Type 3 errors: Cases where re-encoding the slow tokenizer's output through the slow tokenizer yields a different result that matches the fast tokenizer's output.
check_LTR_mark(line, idx, fast): Handles Unicode Right-to-Left mark (\u200f) differences.
check_details(line, spm_ids, tok_ids, slow, fast): Performs detailed diff analysis, finding the first and last divergence points, then attempting subdivision for complex multi-error cases.

The script reports running totals of perfect/imperfect/wrong matches and a final accuracy percentage per checkpoint.

Usage

Use this script when:

Validating that a new fast tokenizer implementation matches its slow counterpart.
Running tokenizer regression tests across the full XNLI multilingual corpus.
Debugging tokenization discrepancies between slow and fast tokenizer versions.

Code Reference

Source Location

examples/NLU/scripts/check_tokenizers.py (169 lines)

Signature

TOKENIZER_CLASSES = {
    name: (slow_class, fast_class)
    for name in SLOW_TO_FAST_CONVERTERS
}

def check_diff(spm_diff: list, tok_diff: list, slow, fast) -> bool: ...
def check_LTR_mark(line: str, idx: int, fast) -> bool: ...
def check_details(line: str, spm_ids: list, tok_ids: list, slow, fast) -> bool: ...
def test_string(slow, fast, text: str) -> None: ...
def test_tokenizer(slow, fast) -> None: ...

Import / CLI Usage

# Run from repository root
python scripts/check_tokenizers.py

I/O Contract

Inputs

Input	Type	Description
XNLI dataset	HuggingFace Dataset	Automatically downloaded; `test + validation` splits covering 15 languages
`SLOW_TO_FAST_CONVERTERS`	dict	Discovered from `transformers.convert_slow_tokenizer` to enumerate all tokenizer pairs
Pretrained tokenizer checkpoints	Remote/Cached	Downloaded via `from_pretrained(checkpoint, force_download=True)`

Outputs

Output	Type	Description
Console output	stdout	Per-checkpoint progress (perfect/imperfect/wrong counts every 10000 examples) and final accuracy
AssertionError	Exception	Raised when a true tokenization mismatch (non-acceptable) is detected

Usage Examples

# Run the full tokenizer equivalence check
python scripts/check_tokenizers.py

# Example output:
# ========================== Checking BertTokenizer: bert-base-uncased ==========================
# (10000 / 0 / 0 ----- 10000)
# (20000 / 3 / 0 ----- 20003)
# ...
# Accuracy 99.98

# ========================== Checking XLNetTokenizer: xlnet-base-cased ==========================
# (10000 / 12 / 0 ----- 10012)
# ...
# Accuracy 99.91

Related Pages

Environment:Microsoft_LoRA_NLU_Conda_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment