Implementation:OpenGVLab InternVL InternLM2Tokenizer

Knowledge Sources	OpenGVLab_InternVL
Domains	Tokenization, Language Model, InternLM2
Last Updated	2026-02-07 14:00 GMT

Overview

Implements the SentencePiece-based tokenizer for the InternLM2 model, handling text-to-token and token-to-text conversions with BOS/EOS special token management.

Description

InternLM2Tokenizer extends HuggingFace's PreTrainedTokenizer using a SentencePiece BPE model. Key features include:

Vocabulary management -- Loads a SentencePiece model file (tokenizer.model) and exposes vocab size, BOS/EOS token IDs, and full vocabulary via get_vocab().
Tokenization -- The _tokenize method encodes text via SentencePiece's encode method with string output type. _convert_token_to_id and _convert_id_to_token handle token-ID bidirectional mapping.
Special token handling -- build_inputs_with_special_tokens optionally prepends BOS and appends EOS tokens. get_special_tokens_mask returns a binary mask indicating special token positions.
Prefix space handling -- The no_prefix_space_tokens property identifies tokens that do not start with the SentencePiece word boundary marker (_), used by _maybe_add_prefix_space during detokenization.
Detokenization -- convert_tokens_to_string handles special tokens separately from regular tokens, using SentencePiece decode for regular sub-tokens and string concatenation for special tokens.
Vocabulary persistence -- save_vocabulary copies or serializes the SentencePiece model to the target directory.

Registered as AutoTokenizer for HuggingFace auto-class integration.

Usage

Use this tokenizer when working with InternLM2-based InternVL models to convert text inputs into token IDs for the language model and to decode model outputs back to human-readable text.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat/internvl/model/internlm2/tokenization_internlm2.py
Lines: 1-235

Signature

class InternLM2Tokenizer(PreTrainedTokenizer):
    _auto_class = 'AutoTokenizer'

    def __init__(self, vocab_file, unk_token='<unk>', bos_token='<s>',
                 eos_token='</s>', pad_token='</s>',
                 sp_model_kwargs=None, add_bos_token=True,
                 add_eos_token=False, ...): ...

    def vocab_size(self) -> int: ...
    def get_vocab(self) -> dict: ...
    def _tokenize(self, text) -> list: ...
    def convert_tokens_to_string(self, tokens) -> str: ...
    def save_vocabulary(self, save_directory, filename_prefix=None) -> Tuple[str]: ...
    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> list: ...

Import

from internvl.model.internlm2.tokenization_internlm2 import InternLM2Tokenizer

I/O Contract

Inputs

Name	Type	Required	Description
vocab_file	str	Yes	Path to the SentencePiece model file (tokenizer.model)
add_bos_token	bool	No	Whether to prepend BOS token (default: True)
add_eos_token	bool	No	Whether to append EOS token (default: False)

Outputs

Name	Type	Description
tokenizer	InternLM2Tokenizer	A tokenizer instance for encoding/decoding text

Usage Examples

Basic Usage

from internvl.model.internlm2.tokenization_internlm2 import InternLM2Tokenizer

tokenizer = InternLM2Tokenizer(vocab_file='path/to/tokenizer.model')

# Encode text
token_ids = tokenizer.encode("Hello, world!")

# Decode back to text
text = tokenizer.decode(token_ids)

Related Pages

Principle:OpenGVLab_InternVL_SentencePiece_BPE_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment