Implementation:Turboderp org Exllamav2 ExLlamaV2TokenizerBase
| Knowledge Sources | |
|---|---|
| Domains | Tokenization |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
ExLlamaV2TokenizerBase is an abstract base class defining the interface that all ExLlamaV2 tokenizer implementations must provide, including encoding, decoding, special token IDs, and vocabulary enumeration.
Description
This class establishes the contract for tokenizer backends in ExLlamaV2. It defines abstract methods for core tokenization operations (encode, decode, id_to_piece, piece_to_id, vocab_size, enumerate_tokens) and special token accessors (unk_id, pad_id, bos_id, eos_id, and their string counterparts). All abstract methods raise NotImplementedError by default, requiring subclasses to provide implementations.
The base class also provides concrete helper methods:
- clean_special_chars(p) - Replaces tokenizer-internal space and newline representations with their actual characters. Uses the abstract space_char() and newline_char() methods to get the replacement mappings.
- piece_to_ord(p) - Converts a token piece string to its ordinal value. Handles both hex-encoded byte tokens (e.g.,
<0x0A>for newline) via regex matching and single-character pieces. Returns -1 if the piece does not represent a single byte.
- id_to_ord(idx) - Convenience method combining id_to_piece() and piece_to_ord() to get the ordinal value of a token ID directly.
- deduce_char_map(input_char) - Determines the internal representation of a character by encoding it, retrieving the piece, and resolving hex-encoded forms. Used by subclasses to auto-detect how the tokenizer represents spaces and newlines internally.
The class uses a compiled regex pattern ord_exp (^<0x([0-9A-Fa-f]+)>$) for parsing hex-encoded byte tokens.
Usage
Do not instantiate ExLlamaV2TokenizerBase directly. Use a concrete subclass such as ExLlamaV2TokenizerHF for HuggingFace tokenizers or ExLlamaV2TokenizerSPM for SentencePiece models. This base class is referenced when writing code that is tokenizer-backend agnostic.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: exllamav2/tokenizer/base.py
- Lines: 1-62
Signature
class ExLlamaV2TokenizerBase:
ord_exp = re.compile(r"^<0x([0-9A-Fa-f]+)>$")
def __init__(self): ...
# Abstract methods (raise NotImplementedError)
def unk_id(self) -> int or None: ...
def pad_id(self) -> int or None: ...
def bos_id(self) -> int or None: ...
def eos_id(self) -> int or None: ...
def unk_token(self) -> str or None: ...
def pad_token(self) -> str or None: ...
def bos_token(self) -> str or None: ...
def eos_token(self) -> str or None: ...
def space_char(self) -> str: ...
def newline_char(self) -> str: ...
def enumerate_tokens(self): ...
def vocab_size(self) -> int: ...
def id_to_piece(self, idx: int) -> str: ...
def piece_to_id(self, text: str) -> int: ...
def decode(self, ids: list) -> str: ...
def encode(self, text: list or str) -> list: ...
# Concrete helper methods
def clean_special_chars(self, p) -> str: ...
def piece_to_ord(self, p) -> int: ...
def id_to_ord(self, idx: int) -> int: ...
def deduce_char_map(self, input_char) -> str: ...
Import
from exllamav2.tokenizer.base import ExLlamaV2TokenizerBase
I/O Contract
Abstract Methods
| Method | Return Type | Description |
|---|---|---|
| unk_id() | None | Unknown token ID, or None if not defined |
| pad_id() | None | Padding token ID, or None if not defined |
| bos_id() | None | Beginning-of-sequence token ID, or None |
| eos_id() | None | End-of-sequence token ID, or None |
| vocab_size() | int |
Total number of tokens in the vocabulary |
| encode(text) | list |
Encode text string (or list) into token ID list |
| decode(ids) | str |
Decode token ID list into text string |
| id_to_piece(idx) | str |
Convert token ID to its string piece representation |
| piece_to_id(text) | int |
Convert string piece to its token ID |
| enumerate_tokens() | iterator |
Yield (index, piece) pairs for the full vocabulary |
Helper Methods
| Method | Parameter | Return | Description |
|---|---|---|---|
| clean_special_chars | p: str |
str |
Replace internal space/newline chars with standard characters |
| piece_to_ord | p: str |
int |
Get byte ordinal of a token piece (-1 if not a single byte) |
| id_to_ord | idx: int |
int |
Get byte ordinal of a token ID (-1 if not a single byte) |
| deduce_char_map | input_char: str |
str |
Determine internal representation of a character |
Usage Examples
from exllamav2.tokenizer.base import ExLlamaV2TokenizerBase
# ExLlamaV2TokenizerBase is abstract; use a concrete subclass
# This example shows the interface contract
class MyTokenizer(ExLlamaV2TokenizerBase):
def unk_id(self): return 0
def pad_id(self): return None
def bos_id(self): return 1
def eos_id(self): return 2
def vocab_size(self): return 32000
def space_char(self): return "\u2581"
def newline_char(self): return "\n"
def encode(self, text): ...
def decode(self, ids): ...
def id_to_piece(self, idx): ...
def piece_to_id(self, text): ...
def enumerate_tokens(self): ...
def unk_token(self): return "<unk>"
def pad_token(self): return None
def bos_token(self): return "<s>"
def eos_token(self): return "</s>"
# Using helper methods from the base class
tok = MyTokenizer()
cleaned = tok.clean_special_chars("\u2581hello") # " hello"
ordinal = tok.piece_to_ord("<0x0A>") # 10 (newline)
ordinal_none = tok.piece_to_ord("hello") # -1 (not a single byte)
Related Pages
- Turboderp_org_Exllamav2_ExLlamaV2TokenizerHF - HuggingFace tokenizer implementation of this base class