Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA Check Tokenizers

From Leeroopedia


Template:Implementation meta

Overview

The check_tokenizers.py script validates that fast (Rust-based) tokenizer implementations produce equivalent outputs to their slow (Python-based) counterparts across all convertible tokenizer classes using the multilingual XNLI dataset.

Description

HuggingFace Transformers provides two tokenizer implementations for most models: a slow Python-based tokenizer and a fast Rust-based tokenizer (from the tokenizers library). This script systematically validates that both implementations produce identical token IDs for the same input text across a diverse multilingual corpus.

Test Setup:

  • Dynamically discovers all tokenizer pairs from SLOW_TO_FAST_CONVERTERS in transformers.convert_slow_tokenizer.
  • Loads the XNLI dataset (test + validation splits), which covers 15 languages, providing comprehensive multilingual coverage.
  • For each tokenizer class, iterates over all registered checkpoint names.

Validation Logic:

  • test_string(slow, fast, text): Encodes the same text with both tokenizers and compares the resulting token IDs. Tracks three categories: perfect matches, imperfect matches (acceptable divergences), and wrong matches (real errors).
  • check_diff(spm_diff, tok_diff, slow, fast): Identifies acceptable tokenization differences:
    • Reversed order: AAA -> AA+A vs A+AA (both valid segmentations).
    • Second-order equivalence: Different subword splits that decode to the same string (e.g., Barr+ich vs Bar+rich).
    • Type 3 errors: Cases where re-encoding the slow tokenizer's output through the slow tokenizer yields a different result that matches the fast tokenizer's output.
  • check_LTR_mark(line, idx, fast): Handles Unicode Right-to-Left mark (\u200f) differences.
  • check_details(line, spm_ids, tok_ids, slow, fast): Performs detailed diff analysis, finding the first and last divergence points, then attempting subdivision for complex multi-error cases.

The script reports running totals of perfect/imperfect/wrong matches and a final accuracy percentage per checkpoint.

Usage

Use this script when:

  • Validating that a new fast tokenizer implementation matches its slow counterpart.
  • Running tokenizer regression tests across the full XNLI multilingual corpus.
  • Debugging tokenization discrepancies between slow and fast tokenizer versions.

Code Reference

Source Location

examples/NLU/scripts/check_tokenizers.py (169 lines)

Signature

TOKENIZER_CLASSES = {
    name: (slow_class, fast_class)
    for name in SLOW_TO_FAST_CONVERTERS
}

def check_diff(spm_diff: list, tok_diff: list, slow, fast) -> bool: ...
def check_LTR_mark(line: str, idx: int, fast) -> bool: ...
def check_details(line: str, spm_ids: list, tok_ids: list, slow, fast) -> bool: ...
def test_string(slow, fast, text: str) -> None: ...
def test_tokenizer(slow, fast) -> None: ...

Import / CLI Usage

# Run from repository root
python scripts/check_tokenizers.py

I/O Contract

Inputs

Input Type Description
XNLI dataset HuggingFace Dataset Automatically downloaded; test + validation splits covering 15 languages
SLOW_TO_FAST_CONVERTERS dict Discovered from transformers.convert_slow_tokenizer to enumerate all tokenizer pairs
Pretrained tokenizer checkpoints Remote/Cached Downloaded via from_pretrained(checkpoint, force_download=True)

Outputs

Output Type Description
Console output stdout Per-checkpoint progress (perfect/imperfect/wrong counts every 10000 examples) and final accuracy
AssertionError Exception Raised when a true tokenization mismatch (non-acceptable) is detected

Usage Examples

# Run the full tokenizer equivalence check
python scripts/check_tokenizers.py

# Example output:
# ========================== Checking BertTokenizer: bert-base-uncased ==========================
# (10000 / 0 / 0 ----- 10000)
# (20000 / 3 / 0 ----- 20003)
# ...
# Accuracy 99.98

# ========================== Checking XLNetTokenizer: xlnet-base-cased ==========================
# (10000 / 12 / 0 ----- 10012)
# ...
# Accuracy 99.91

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment