Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL InternLM2Tokenizer

From Leeroopedia
Revision as of 16:14, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/OpenGVLab_InternVL_InternLM2Tokenizer.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Tokenization, Language Model, InternLM2
Last Updated 2026-02-07 14:00 GMT

Overview

Implements the SentencePiece-based tokenizer for the InternLM2 model, handling text-to-token and token-to-text conversions with BOS/EOS special token management.

Description

InternLM2Tokenizer extends HuggingFace's PreTrainedTokenizer using a SentencePiece BPE model. Key features include:

  • Vocabulary management -- Loads a SentencePiece model file (tokenizer.model) and exposes vocab size, BOS/EOS token IDs, and full vocabulary via get_vocab().
  • Tokenization -- The _tokenize method encodes text via SentencePiece's encode method with string output type. _convert_token_to_id and _convert_id_to_token handle token-ID bidirectional mapping.
  • Special token handling -- build_inputs_with_special_tokens optionally prepends BOS and appends EOS tokens. get_special_tokens_mask returns a binary mask indicating special token positions.
  • Prefix space handling -- The no_prefix_space_tokens property identifies tokens that do not start with the SentencePiece word boundary marker (_), used by _maybe_add_prefix_space during detokenization.
  • Detokenization -- convert_tokens_to_string handles special tokens separately from regular tokens, using SentencePiece decode for regular sub-tokens and string concatenation for special tokens.
  • Vocabulary persistence -- save_vocabulary copies or serializes the SentencePiece model to the target directory.

Registered as AutoTokenizer for HuggingFace auto-class integration.

Usage

Use this tokenizer when working with InternLM2-based InternVL models to convert text inputs into token IDs for the language model and to decode model outputs back to human-readable text.

Code Reference

Source Location

Signature

class InternLM2Tokenizer(PreTrainedTokenizer):
    _auto_class = 'AutoTokenizer'

    def __init__(self, vocab_file, unk_token='<unk>', bos_token='<s>',
                 eos_token='</s>', pad_token='</s>',
                 sp_model_kwargs=None, add_bos_token=True,
                 add_eos_token=False, ...): ...

    def vocab_size(self) -> int: ...
    def get_vocab(self) -> dict: ...
    def _tokenize(self, text) -> list: ...
    def convert_tokens_to_string(self, tokens) -> str: ...
    def save_vocabulary(self, save_directory, filename_prefix=None) -> Tuple[str]: ...
    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None) -> list: ...

Import

from internvl.model.internlm2.tokenization_internlm2 import InternLM2Tokenizer

I/O Contract

Inputs

Name Type Required Description
vocab_file str Yes Path to the SentencePiece model file (tokenizer.model)
add_bos_token bool No Whether to prepend BOS token (default: True)
add_eos_token bool No Whether to append EOS token (default: False)

Outputs

Name Type Description
tokenizer InternLM2Tokenizer A tokenizer instance for encoding/decoding text

Usage Examples

Basic Usage

from internvl.model.internlm2.tokenization_internlm2 import InternLM2Tokenizer

tokenizer = InternLM2Tokenizer(vocab_file='path/to/tokenizer.model')

# Encode text
token_ids = tokenizer.encode("Hello, world!")

# Decode back to text
text = tokenizer.decode(token_ids)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment