Implementation:OpenGVLab InternVL InternLM2Tokenizer
| Knowledge Sources | |
|---|---|
| Domains | Tokenization, Language Model, InternLM2 |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Implements the SentencePiece-based tokenizer for the InternLM2 model, handling text-to-token and token-to-text conversions with BOS/EOS special token management.
Description
InternLM2Tokenizer extends HuggingFace's PreTrainedTokenizer using a SentencePiece BPE model. Key features include:
- Vocabulary management -- Loads a SentencePiece model file (tokenizer.model) and exposes vocab size, BOS/EOS token IDs, and full vocabulary via get_vocab().
- Tokenization -- The _tokenize method encodes text via SentencePiece's encode method with string output type. _convert_token_to_id and _convert_id_to_token handle token-ID bidirectional mapping.
- Special token handling -- build_inputs_with_special_tokens optionally prepends BOS and appends EOS tokens. get_special_tokens_mask returns a binary mask indicating special token positions.
- Prefix space handling -- The no_prefix_space_tokens property identifies tokens that do not start with the SentencePiece word boundary marker (_), used by _maybe_add_prefix_space during detokenization.
- Detokenization -- convert_tokens_to_string handles special tokens separately from regular tokens, using SentencePiece decode for regular sub-tokens and string concatenation for special tokens.
- Vocabulary persistence -- save_vocabulary copies or serializes the SentencePiece model to the target directory.
Registered as AutoTokenizer for HuggingFace auto-class integration.
Usage
Use this tokenizer when working with InternLM2-based InternVL models to convert text inputs into token IDs for the language model and to decode model outputs back to human-readable text.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat/internvl/model/internlm2/tokenization_internlm2.py
- Lines: 1-235
Signature
class InternLM2Tokenizer(PreTrainedTokenizer):
_auto_class = 'AutoTokenizer'
def __init__(self, vocab_file, unk_token='<unk>', bos_token='<s>',
eos_token='</s>', pad_token='</s>',
sp_model_kwargs=None, add_bos_token=True,
add_eos_token=False, ...): ...
def vocab_size(self) -> int: ...
def get_vocab(self) -> dict: ...
def _tokenize(self, text) -> list: ...
def convert_tokens_to_string(self, tokens) -> str: ...
def save_vocabulary(self, save_directory, filename_prefix=None) -> Tuple[str]: ...
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> list: ...
Import
from internvl.model.internlm2.tokenization_internlm2 import InternLM2Tokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| vocab_file | str | Yes | Path to the SentencePiece model file (tokenizer.model) |
| add_bos_token | bool | No | Whether to prepend BOS token (default: True) |
| add_eos_token | bool | No | Whether to append EOS token (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| tokenizer | InternLM2Tokenizer | A tokenizer instance for encoding/decoding text |
Usage Examples
Basic Usage
from internvl.model.internlm2.tokenization_internlm2 import InternLM2Tokenizer
tokenizer = InternLM2Tokenizer(vocab_file='path/to/tokenizer.model')
# Encode text
token_ids = tokenizer.encode("Hello, world!")
# Decode back to text
text = tokenizer.decode(token_ids)