Principle:Turboderp org Exllamav2 Tokenizer Initialization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Tokenization, Text_Processing |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Tokenization converts raw text into numerical token IDs that a language model can process, and decodes token IDs back into human-readable text.
Description
Language models operate on discrete token sequences, not raw text. A tokenizer defines the vocabulary (the set of all valid tokens) and the rules for splitting text into tokens. Key aspects include:
- Encoding: Converting a string into a sequence of integer token IDs. This involves splitting the text according to the tokenizer's algorithm (BPE, SentencePiece, etc.) and mapping each piece to its vocabulary index.
- Decoding: Converting a sequence of token IDs back into a string. This must handle partial tokens, byte-level encodings, and special token markers correctly.
- Special tokens: Models rely on special tokens with semantic meaning:
- BOS (Beginning of Sequence): Marks the start of input. Some models (Llama) require it; others (Mistral) do not.
- EOS (End of Sequence): Signals generation completion. Used as a stop condition.
- Padding token: Used to pad sequences to equal length in batched inference.
- Added vocabulary tokens: Tokens added after initial training (e.g., chat-specific tokens like <|im_start|>, <|im_end|>).
- Model-specific conventions: Different model families have different tokenizer configurations. Some use SentencePiece models (tokenizer.model), others use JSON-based tokenizers (tokenizer.json), and they may differ in how they handle whitespace, capitalization, and special characters.
Correct tokenization is critical because the model was trained with a specific tokenizer, and using a different tokenization scheme would produce meaningless input to the model.
Usage
Tokenizer initialization is required for any text-based interaction with the model:
- Encoding user prompts before inference
- Decoding generated token IDs into text
- Determining stop condition token IDs
- Measuring prompt length in tokens for context management
Theoretical Basis
# Tokenization process:
text = "Hello, world!"
# Step 1: Pre-tokenization (split into words/pieces based on rules)
pieces = pre_tokenize(text) # ["Hello", ",", " world", "!"]
# Step 2: Subword encoding (BPE merges or SentencePiece)
token_ids = []
for piece in pieces:
ids = encode_subword(piece, vocabulary, merge_rules)
token_ids.extend(ids)
# Step 3: Add special tokens if required
if model_requires_bos:
token_ids = [bos_token_id] + token_ids
# Result: [1, 15043, 29892, 3186, 29991] (example for Llama)
Vocabulary Structure
# A typical LLM vocabulary:
# - Base vocabulary: 32,000 - 128,000 tokens
# - Special tokens: BOS, EOS, PAD, UNK
# - Added tokens: model-specific control tokens
#
# Token ID mapping:
# 0 -> <unk> (unknown)
# 1 -> <s> (BOS)
# 2 -> </s> (EOS)
# 3-255 -> byte tokens (fallback for unknown characters)
# 256+ -> learned subword tokens