Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 ExLlamaV2TokenizerBase

From Leeroopedia
Knowledge Sources
Domains Tokenization
Last Updated 2026-02-15 00:00 GMT

Overview

ExLlamaV2TokenizerBase is an abstract base class defining the interface that all ExLlamaV2 tokenizer implementations must provide, including encoding, decoding, special token IDs, and vocabulary enumeration.

Description

This class establishes the contract for tokenizer backends in ExLlamaV2. It defines abstract methods for core tokenization operations (encode, decode, id_to_piece, piece_to_id, vocab_size, enumerate_tokens) and special token accessors (unk_id, pad_id, bos_id, eos_id, and their string counterparts). All abstract methods raise NotImplementedError by default, requiring subclasses to provide implementations.

The base class also provides concrete helper methods:

  • clean_special_chars(p) - Replaces tokenizer-internal space and newline representations with their actual characters. Uses the abstract space_char() and newline_char() methods to get the replacement mappings.
  • piece_to_ord(p) - Converts a token piece string to its ordinal value. Handles both hex-encoded byte tokens (e.g., <0x0A> for newline) via regex matching and single-character pieces. Returns -1 if the piece does not represent a single byte.
  • id_to_ord(idx) - Convenience method combining id_to_piece() and piece_to_ord() to get the ordinal value of a token ID directly.
  • deduce_char_map(input_char) - Determines the internal representation of a character by encoding it, retrieving the piece, and resolving hex-encoded forms. Used by subclasses to auto-detect how the tokenizer represents spaces and newlines internally.

The class uses a compiled regex pattern ord_exp (^<0x([0-9A-Fa-f]+)>$) for parsing hex-encoded byte tokens.

Usage

Do not instantiate ExLlamaV2TokenizerBase directly. Use a concrete subclass such as ExLlamaV2TokenizerHF for HuggingFace tokenizers or ExLlamaV2TokenizerSPM for SentencePiece models. This base class is referenced when writing code that is tokenizer-backend agnostic.

Code Reference

Source Location

Signature

class ExLlamaV2TokenizerBase:

    ord_exp = re.compile(r"^<0x([0-9A-Fa-f]+)>$")

    def __init__(self): ...

    # Abstract methods (raise NotImplementedError)
    def unk_id(self) -> int or None: ...
    def pad_id(self) -> int or None: ...
    def bos_id(self) -> int or None: ...
    def eos_id(self) -> int or None: ...
    def unk_token(self) -> str or None: ...
    def pad_token(self) -> str or None: ...
    def bos_token(self) -> str or None: ...
    def eos_token(self) -> str or None: ...
    def space_char(self) -> str: ...
    def newline_char(self) -> str: ...
    def enumerate_tokens(self): ...
    def vocab_size(self) -> int: ...
    def id_to_piece(self, idx: int) -> str: ...
    def piece_to_id(self, text: str) -> int: ...
    def decode(self, ids: list) -> str: ...
    def encode(self, text: list or str) -> list: ...

    # Concrete helper methods
    def clean_special_chars(self, p) -> str: ...
    def piece_to_ord(self, p) -> int: ...
    def id_to_ord(self, idx: int) -> int: ...
    def deduce_char_map(self, input_char) -> str: ...

Import

from exllamav2.tokenizer.base import ExLlamaV2TokenizerBase

I/O Contract

Abstract Methods

Method Return Type Description
unk_id() None Unknown token ID, or None if not defined
pad_id() None Padding token ID, or None if not defined
bos_id() None Beginning-of-sequence token ID, or None
eos_id() None End-of-sequence token ID, or None
vocab_size() int Total number of tokens in the vocabulary
encode(text) list Encode text string (or list) into token ID list
decode(ids) str Decode token ID list into text string
id_to_piece(idx) str Convert token ID to its string piece representation
piece_to_id(text) int Convert string piece to its token ID
enumerate_tokens() iterator Yield (index, piece) pairs for the full vocabulary

Helper Methods

Method Parameter Return Description
clean_special_chars p: str str Replace internal space/newline chars with standard characters
piece_to_ord p: str int Get byte ordinal of a token piece (-1 if not a single byte)
id_to_ord idx: int int Get byte ordinal of a token ID (-1 if not a single byte)
deduce_char_map input_char: str str Determine internal representation of a character

Usage Examples

from exllamav2.tokenizer.base import ExLlamaV2TokenizerBase

# ExLlamaV2TokenizerBase is abstract; use a concrete subclass
# This example shows the interface contract

class MyTokenizer(ExLlamaV2TokenizerBase):
    def unk_id(self): return 0
    def pad_id(self): return None
    def bos_id(self): return 1
    def eos_id(self): return 2
    def vocab_size(self): return 32000
    def space_char(self): return "\u2581"
    def newline_char(self): return "\n"
    def encode(self, text): ...
    def decode(self, ids): ...
    def id_to_piece(self, idx): ...
    def piece_to_id(self, text): ...
    def enumerate_tokens(self): ...
    def unk_token(self): return "<unk>"
    def pad_token(self): return None
    def bos_token(self): return "<s>"
    def eos_token(self): return "</s>"

# Using helper methods from the base class
tok = MyTokenizer()
cleaned = tok.clean_special_chars("\u2581hello")  # " hello"
ordinal = tok.piece_to_ord("<0x0A>")              # 10 (newline)
ordinal_none = tok.piece_to_ord("hello")          # -1 (not a single byte)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment