Implementation:Turboderp org Exllamav2 ExLlamaV2TokenizerBase

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Tokenization
Last Updated	2026-02-15 00:00 GMT

Overview

ExLlamaV2TokenizerBase is an abstract base class defining the interface that all ExLlamaV2 tokenizer implementations must provide, including encoding, decoding, special token IDs, and vocabulary enumeration.

Description

This class establishes the contract for tokenizer backends in ExLlamaV2. It defines abstract methods for core tokenization operations (encode, decode, id_to_piece, piece_to_id, vocab_size, enumerate_tokens) and special token accessors (unk_id, pad_id, bos_id, eos_id, and their string counterparts). All abstract methods raise NotImplementedError by default, requiring subclasses to provide implementations.

The base class also provides concrete helper methods:

clean_special_chars(p) - Replaces tokenizer-internal space and newline representations with their actual characters. Uses the abstract space_char() and newline_char() methods to get the replacement mappings.

piece_to_ord(p) - Converts a token piece string to its ordinal value. Handles both hex-encoded byte tokens (e.g., <0x0A> for newline) via regex matching and single-character pieces. Returns -1 if the piece does not represent a single byte.

id_to_ord(idx) - Convenience method combining id_to_piece() and piece_to_ord() to get the ordinal value of a token ID directly.

deduce_char_map(input_char) - Determines the internal representation of a character by encoding it, retrieving the piece, and resolving hex-encoded forms. Used by subclasses to auto-detect how the tokenizer represents spaces and newlines internally.

The class uses a compiled regex pattern ord_exp (^<0x([0-9A-Fa-f]+)>$) for parsing hex-encoded byte tokens.

Usage

Do not instantiate ExLlamaV2TokenizerBase directly. Use a concrete subclass such as ExLlamaV2TokenizerHF for HuggingFace tokenizers or ExLlamaV2TokenizerSPM for SentencePiece models. This base class is referenced when writing code that is tokenizer-backend agnostic.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/tokenizer/base.py
Lines: 1-62

Signature

class ExLlamaV2TokenizerBase:

    ord_exp = re.compile(r"^<0x([0-9A-Fa-f]+)>$")

    def __init__(self): ...

    # Abstract methods (raise NotImplementedError)
    def unk_id(self) -> int or None: ...
    def pad_id(self) -> int or None: ...
    def bos_id(self) -> int or None: ...
    def eos_id(self) -> int or None: ...
    def unk_token(self) -> str or None: ...
    def pad_token(self) -> str or None: ...
    def bos_token(self) -> str or None: ...
    def eos_token(self) -> str or None: ...
    def space_char(self) -> str: ...
    def newline_char(self) -> str: ...
    def enumerate_tokens(self): ...
    def vocab_size(self) -> int: ...
    def id_to_piece(self, idx: int) -> str: ...
    def piece_to_id(self, text: str) -> int: ...
    def decode(self, ids: list) -> str: ...
    def encode(self, text: list or str) -> list: ...

    # Concrete helper methods
    def clean_special_chars(self, p) -> str: ...
    def piece_to_ord(self, p) -> int: ...
    def id_to_ord(self, idx: int) -> int: ...
    def deduce_char_map(self, input_char) -> str: ...

Import

from exllamav2.tokenizer.base import ExLlamaV2TokenizerBase

I/O Contract

Abstract Methods

Method	Return Type	Description
unk_id()	None	Unknown token ID, or None if not defined
pad_id()	None	Padding token ID, or None if not defined
bos_id()	None	Beginning-of-sequence token ID, or None
eos_id()	None	End-of-sequence token ID, or None
vocab_size()	`int`	Total number of tokens in the vocabulary
encode(text)	`list`	Encode text string (or list) into token ID list
decode(ids)	`str`	Decode token ID list into text string
id_to_piece(idx)	`str`	Convert token ID to its string piece representation
piece_to_id(text)	`int`	Convert string piece to its token ID
enumerate_tokens()	`iterator`	Yield (index, piece) pairs for the full vocabulary

Helper Methods

Method	Parameter	Return	Description
clean_special_chars	`p: str`	`str`	Replace internal space/newline chars with standard characters
piece_to_ord	`p: str`	`int`	Get byte ordinal of a token piece (-1 if not a single byte)
id_to_ord	`idx: int`	`int`	Get byte ordinal of a token ID (-1 if not a single byte)
deduce_char_map	`input_char: str`	`str`	Determine internal representation of a character

Usage Examples

from exllamav2.tokenizer.base import ExLlamaV2TokenizerBase

# ExLlamaV2TokenizerBase is abstract; use a concrete subclass
# This example shows the interface contract

class MyTokenizer(ExLlamaV2TokenizerBase):
    def unk_id(self): return 0
    def pad_id(self): return None
    def bos_id(self): return 1
    def eos_id(self): return 2
    def vocab_size(self): return 32000
    def space_char(self): return "\u2581"
    def newline_char(self): return "\n"
    def encode(self, text): ...
    def decode(self, ids): ...
    def id_to_piece(self, idx): ...
    def piece_to_id(self, text): ...
    def enumerate_tokens(self): ...
    def unk_token(self): return "<unk>"
    def pad_token(self): return None
    def bos_token(self): return "<s>"
    def eos_token(self): return "</s>"

# Using helper methods from the base class
tok = MyTokenizer()
cleaned = tok.clean_special_chars("\u2581hello")  # " hello"
ordinal = tok.piece_to_ord("<0x0A>")              # 10 (newline)
ordinal_none = tok.piece_to_ord("hello")          # -1 (not a single byte)

Related Pages

Turboderp_org_Exllamav2_ExLlamaV2TokenizerHF - HuggingFace tokenizer implementation of this base class

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment