Implementation:Microsoft LoRA Check Tokenizers
Overview
The check_tokenizers.py script validates that fast (Rust-based) tokenizer implementations produce equivalent outputs to their slow (Python-based) counterparts across all convertible tokenizer classes using the multilingual XNLI dataset.
Description
HuggingFace Transformers provides two tokenizer implementations for most models: a slow Python-based tokenizer and a fast Rust-based tokenizer (from the tokenizers library). This script systematically validates that both implementations produce identical token IDs for the same input text across a diverse multilingual corpus.
Test Setup:
- Dynamically discovers all tokenizer pairs from
SLOW_TO_FAST_CONVERTERSintransformers.convert_slow_tokenizer. - Loads the XNLI dataset (
test + validationsplits), which covers 15 languages, providing comprehensive multilingual coverage. - For each tokenizer class, iterates over all registered checkpoint names.
Validation Logic:
test_string(slow, fast, text): Encodes the same text with both tokenizers and compares the resulting token IDs. Tracks three categories: perfect matches, imperfect matches (acceptable divergences), and wrong matches (real errors).check_diff(spm_diff, tok_diff, slow, fast): Identifies acceptable tokenization differences:- Reversed order:
AAA -> AA+AvsA+AA(both valid segmentations). - Second-order equivalence: Different subword splits that decode to the same string (e.g.,
Barr+ichvsBar+rich). - Type 3 errors: Cases where re-encoding the slow tokenizer's output through the slow tokenizer yields a different result that matches the fast tokenizer's output.
- Reversed order:
check_LTR_mark(line, idx, fast): Handles Unicode Right-to-Left mark (\u200f) differences.check_details(line, spm_ids, tok_ids, slow, fast): Performs detailed diff analysis, finding the first and last divergence points, then attempting subdivision for complex multi-error cases.
The script reports running totals of perfect/imperfect/wrong matches and a final accuracy percentage per checkpoint.
Usage
Use this script when:
- Validating that a new fast tokenizer implementation matches its slow counterpart.
- Running tokenizer regression tests across the full XNLI multilingual corpus.
- Debugging tokenization discrepancies between slow and fast tokenizer versions.
Code Reference
Source Location
examples/NLU/scripts/check_tokenizers.py (169 lines)
Signature
TOKENIZER_CLASSES = {
name: (slow_class, fast_class)
for name in SLOW_TO_FAST_CONVERTERS
}
def check_diff(spm_diff: list, tok_diff: list, slow, fast) -> bool: ...
def check_LTR_mark(line: str, idx: int, fast) -> bool: ...
def check_details(line: str, spm_ids: list, tok_ids: list, slow, fast) -> bool: ...
def test_string(slow, fast, text: str) -> None: ...
def test_tokenizer(slow, fast) -> None: ...
Import / CLI Usage
# Run from repository root python scripts/check_tokenizers.py
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
| XNLI dataset | HuggingFace Dataset | Automatically downloaded; test + validation splits covering 15 languages
|
SLOW_TO_FAST_CONVERTERS |
dict | Discovered from transformers.convert_slow_tokenizer to enumerate all tokenizer pairs
|
| Pretrained tokenizer checkpoints | Remote/Cached | Downloaded via from_pretrained(checkpoint, force_download=True)
|
Outputs
| Output | Type | Description |
|---|---|---|
| Console output | stdout | Per-checkpoint progress (perfect/imperfect/wrong counts every 10000 examples) and final accuracy |
| AssertionError | Exception | Raised when a true tokenization mismatch (non-acceptable) is detected |
Usage Examples
# Run the full tokenizer equivalence check python scripts/check_tokenizers.py # Example output: # ========================== Checking BertTokenizer: bert-base-uncased ========================== # (10000 / 0 / 0 ----- 10000) # (20000 / 3 / 0 ----- 20003) # ... # Accuracy 99.98 # ========================== Checking XLNetTokenizer: xlnet-base-cased ========================== # (10000 / 12 / 0 ----- 10012) # ... # Accuracy 99.91