Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenGVLab InternVL SentencePiece BPE Tokenization

From Leeroopedia


Knowledge Sources
Domains Tokenization, NLP, Language Model
Last Updated 2026-02-07 14:00 GMT

Overview

Tokenization converts raw text into a sequence of token IDs that can be processed by the language model, and provides the inverse decoding operation, using SentencePiece Byte-Pair Encoding (BPE) with both a pure-Python and a fast Rust-backed implementation.

Description

The InternLM2 language model uses SentencePiece BPE tokenization with the following key characteristics:

  • Byte-Pair Encoding -- Iteratively merges the most frequent byte/character pairs to build a vocabulary that balances between character-level and word-level representations.
  • Special tokens -- BOS (beginning-of-sequence), EOS (end-of-sequence), UNK (unknown), and PAD tokens are managed separately from the SentencePiece model.
  • Prefix space handling -- The SentencePiece word boundary marker (_) is used to distinguish word-initial tokens from word-internal tokens, with special handling during detokenization.
  • Byte fallback -- Unknown characters are encoded as byte sequences, ensuring that any input text can be tokenized.

Two implementations are provided:

  • Slow tokenizer (InternLM2Tokenizer) -- Pure Python using the SentencePiece library, providing full compatibility and serialization support.
  • Fast tokenizer (InternLM2TokenizerFast) -- Rust-backed using the HuggingFace Tokenizers library with BPE extraction, providing significantly faster batch tokenization for inference.

Both are registered as AutoTokenizer for seamless HuggingFace integration and can be loaded via AutoTokenizer.from_pretrained().

Usage

Use the tokenizer to convert text inputs into token IDs for the InternLM2 language model and to decode model outputs back to human-readable text. The fast tokenizer is preferred for inference throughput; the slow tokenizer is used as a fallback and for vocabulary manipulation.

Theoretical Basis

BPE (Sennrich et al., 2016) is a data-driven tokenization algorithm that starts with a character-level vocabulary and iteratively merges the most frequent adjacent pairs. SentencePiece (Kudo & Richardson, 2018) extends this with a language-independent, unsupervised tokenization framework that treats the text as a raw byte stream, eliminating the need for pre-tokenization. The Rust-backed fast tokenizer provides O(n) tokenization with constant factor improvements from compiled code and parallelism.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment