Implementation:OpenGVLab InternVL InternLM2TokenizerFast
| Knowledge Sources | |
|---|---|
| Domains | Tokenization, Language Model, InternLM2 |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Provides a fast (Rust-backed) tokenizer variant for InternLM2 using the HuggingFace Tokenizers library, offering significantly faster tokenization than the pure-Python SentencePiece implementation.
Description
This module contains two classes:
InternLM2Converter extends SpmConverter and converts the SentencePiece BPE model into HuggingFace's Rust-based tokenizer format:
- vocab method extracts the vocabulary with special tokens (unk, bos, eos) at the beginning.
- tokenizer method builds a BPE tokenizer with byte fallback and fused unknown tokens, extracting merges via SentencePieceExtractor.
- normalizer method configures prepending the SentencePiece word boundary marker (_) and replacing spaces with it.
- decoder method sets up a sequence of Replace, ByteFallback, Fuse, and Strip decoders.
- Registered in SLOW_TO_FAST_CONVERTERS for automatic conversion from the slow tokenizer.
InternLM2TokenizerFast extends PreTrainedTokenizerFast with:
- Left-side padding' (padding_side = 'left).
- BOS/EOS post-processing via update_post_processor which configures template-based processing for single and pair sequences.
- Properties for add_bos_token and add_eos_token with setters that automatically update the post-processor.
- Vocabulary persistence via save_vocabulary that copies the underlying SentencePiece model file.
- Registered as AutoTokenizer for HuggingFace auto-class integration.
Usage
Use this fast tokenizer for high-throughput inference scenarios with InternVL. It is automatically selected when loading tokenizers with AutoTokenizer.from_pretrained() if a fast tokenizer is available.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat/internvl/model/internlm2/tokenization_internlm2_fast.py
- Lines: 1-211
Signature
class InternLM2Converter(SpmConverter):
handle_byte_fallback = True
def vocab(self, proto) -> list: ...
def tokenizer(self, proto) -> Tokenizer: ...
def normalizer(self, proto) -> normalizers.Sequence: ...
def decoder(self, replacement, add_prefix_space) -> decoders.Sequence: ...
class InternLM2TokenizerFast(PreTrainedTokenizerFast):
slow_tokenizer_class = InternLM2Tokenizer
padding_side = 'left'
_auto_class = 'AutoTokenizer'
def __init__(self, vocab_file, ...): ...
def update_post_processor(self): ...
def save_vocabulary(self, save_directory, filename_prefix=None) -> Tuple[str]: ...
Import
from internvl.model.internlm2.tokenization_internlm2_fast import InternLM2TokenizerFast
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| vocab_file | str | Yes | Path to the SentencePiece model file (tokenizer.model) |
| add_bos_token | bool | No | Whether to prepend BOS token (default: True) |
| add_eos_token | bool | No | Whether to append EOS token (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| tokenizer | InternLM2TokenizerFast | A fast Rust-backed tokenizer instance for encoding/decoding text |
Usage Examples
Basic Usage
from internvl.model.internlm2.tokenization_internlm2_fast import InternLM2TokenizerFast
tokenizer = InternLM2TokenizerFast(vocab_file='path/to/tokenizer.model')
# Fast batch tokenization
tokens = tokenizer(["Hello, world!", "How are you?"], padding=True)