Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL InternLM2TokenizerFast

From Leeroopedia


Knowledge Sources
Domains Tokenization, Language Model, InternLM2
Last Updated 2026-02-07 14:00 GMT

Overview

Provides a fast (Rust-backed) tokenizer variant for InternLM2 using the HuggingFace Tokenizers library, offering significantly faster tokenization than the pure-Python SentencePiece implementation.

Description

This module contains two classes:

InternLM2Converter extends SpmConverter and converts the SentencePiece BPE model into HuggingFace's Rust-based tokenizer format:

  • vocab method extracts the vocabulary with special tokens (unk, bos, eos) at the beginning.
  • tokenizer method builds a BPE tokenizer with byte fallback and fused unknown tokens, extracting merges via SentencePieceExtractor.
  • normalizer method configures prepending the SentencePiece word boundary marker (_) and replacing spaces with it.
  • decoder method sets up a sequence of Replace, ByteFallback, Fuse, and Strip decoders.
  • Registered in SLOW_TO_FAST_CONVERTERS for automatic conversion from the slow tokenizer.

InternLM2TokenizerFast extends PreTrainedTokenizerFast with:

  • Left-side padding' (padding_side = 'left).
  • BOS/EOS post-processing via update_post_processor which configures template-based processing for single and pair sequences.
  • Properties for add_bos_token and add_eos_token with setters that automatically update the post-processor.
  • Vocabulary persistence via save_vocabulary that copies the underlying SentencePiece model file.
  • Registered as AutoTokenizer for HuggingFace auto-class integration.

Usage

Use this fast tokenizer for high-throughput inference scenarios with InternVL. It is automatically selected when loading tokenizers with AutoTokenizer.from_pretrained() if a fast tokenizer is available.

Code Reference

Source Location

Signature

class InternLM2Converter(SpmConverter):
    handle_byte_fallback = True
    def vocab(self, proto) -> list: ...
    def tokenizer(self, proto) -> Tokenizer: ...
    def normalizer(self, proto) -> normalizers.Sequence: ...
    def decoder(self, replacement, add_prefix_space) -> decoders.Sequence: ...

class InternLM2TokenizerFast(PreTrainedTokenizerFast):
    slow_tokenizer_class = InternLM2Tokenizer
    padding_side = 'left'
    _auto_class = 'AutoTokenizer'

    def __init__(self, vocab_file, ...): ...
    def update_post_processor(self): ...
    def save_vocabulary(self, save_directory, filename_prefix=None) -> Tuple[str]: ...

Import

from internvl.model.internlm2.tokenization_internlm2_fast import InternLM2TokenizerFast

I/O Contract

Inputs

Name Type Required Description
vocab_file str Yes Path to the SentencePiece model file (tokenizer.model)
add_bos_token bool No Whether to prepend BOS token (default: True)
add_eos_token bool No Whether to append EOS token (default: False)

Outputs

Name Type Description
tokenizer InternLM2TokenizerFast A fast Rust-backed tokenizer instance for encoding/decoding text

Usage Examples

Basic Usage

from internvl.model.internlm2.tokenization_internlm2_fast import InternLM2TokenizerFast

tokenizer = InternLM2TokenizerFast(vocab_file='path/to/tokenizer.model')

# Fast batch tokenization
tokens = tokenizer(["Hello, world!", "How are you?"], padding=True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment