Implementation:OpenGVLab InternVL InternLM2TokenizerFast

Knowledge Sources	OpenGVLab_InternVL
Domains	Tokenization, Language Model, InternLM2
Last Updated	2026-02-07 14:00 GMT

Overview

Provides a fast (Rust-backed) tokenizer variant for InternLM2 using the HuggingFace Tokenizers library, offering significantly faster tokenization than the pure-Python SentencePiece implementation.

Description

This module contains two classes:

InternLM2Converter extends SpmConverter and converts the SentencePiece BPE model into HuggingFace's Rust-based tokenizer format:

vocab method extracts the vocabulary with special tokens (unk, bos, eos) at the beginning.
tokenizer method builds a BPE tokenizer with byte fallback and fused unknown tokens, extracting merges via SentencePieceExtractor.
normalizer method configures prepending the SentencePiece word boundary marker (_) and replacing spaces with it.
decoder method sets up a sequence of Replace, ByteFallback, Fuse, and Strip decoders.
Registered in SLOW_TO_FAST_CONVERTERS for automatic conversion from the slow tokenizer.

InternLM2TokenizerFast extends PreTrainedTokenizerFast with:

Left-side padding' (padding_side = 'left).
BOS/EOS post-processing via update_post_processor which configures template-based processing for single and pair sequences.
Properties for add_bos_token and add_eos_token with setters that automatically update the post-processor.
Vocabulary persistence via save_vocabulary that copies the underlying SentencePiece model file.
Registered as AutoTokenizer for HuggingFace auto-class integration.

Usage

Use this fast tokenizer for high-throughput inference scenarios with InternVL. It is automatically selected when loading tokenizers with AutoTokenizer.from_pretrained() if a fast tokenizer is available.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat/internvl/model/internlm2/tokenization_internlm2_fast.py
Lines: 1-211

Signature

class InternLM2Converter(SpmConverter):
    handle_byte_fallback = True
    def vocab(self, proto) -> list: ...
    def tokenizer(self, proto) -> Tokenizer: ...
    def normalizer(self, proto) -> normalizers.Sequence: ...
    def decoder(self, replacement, add_prefix_space) -> decoders.Sequence: ...

class InternLM2TokenizerFast(PreTrainedTokenizerFast):
    slow_tokenizer_class = InternLM2Tokenizer
    padding_side = 'left'
    _auto_class = 'AutoTokenizer'

    def __init__(self, vocab_file, ...): ...
    def update_post_processor(self): ...
    def save_vocabulary(self, save_directory, filename_prefix=None) -> Tuple[str]: ...

Import

from internvl.model.internlm2.tokenization_internlm2_fast import InternLM2TokenizerFast

I/O Contract

Inputs

Name	Type	Required	Description
vocab_file	str	Yes	Path to the SentencePiece model file (tokenizer.model)
add_bos_token	bool	No	Whether to prepend BOS token (default: True)
add_eos_token	bool	No	Whether to append EOS token (default: False)

Outputs

Name	Type	Description
tokenizer	InternLM2TokenizerFast	A fast Rust-backed tokenizer instance for encoding/decoding text

Usage Examples

Basic Usage

from internvl.model.internlm2.tokenization_internlm2_fast import InternLM2TokenizerFast

tokenizer = InternLM2TokenizerFast(vocab_file='path/to/tokenizer.model')

# Fast batch tokenization
tokens = tokenizer(["Hello, world!", "How are you?"], padding=True)

Related Pages

Principle:OpenGVLab_InternVL_SentencePiece_BPE_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment