Overview
ExLlamaV2TokenizerHF is a concrete tokenizer implementation that wraps the HuggingFace Tokenizers library to provide ExLlamaV2-compatible tokenization, with automatic detection of BPE space and newline characters.
Description
This class extends ExLlamaV2TokenizerBase by wrapping a HuggingFace Tokenizer instance loaded from a tokenizer.json file. It implements all abstract methods defined in the base class.
During initialization, the class loads the tokenizer from the JSON file and inspects the underlying model type. If the model is a BPE tokenizer (e.g., GPT-style), it auto-detects the internal representations of space and newline characters using deduce_char_map() from the base class. For BPE models, spaces are typically represented as a special character like "G" (U+0120) and newlines as "C" (U+010A). Non-BPE models use standard space and newline characters.
The enumerate_tokens() method handles a subtlety of HuggingFace tokenizers: some tokenizers cannot decode individual token IDs in isolation (they produce different results than when decoded as part of a sequence). The method detects this by encoding a test string (" t") and checking whether the decoded single-token result matches the expected output. If not, it uses a prefix-based decoding strategy where each token is decoded as a pair with a space token prefix, then the prefix portion is stripped. The decoded vocabulary is cached in self.vocab to avoid repeated computation.
Special token accessors (unk_id, pad_id, bos_id, eos_id) return None for most tokens since HuggingFace tokenizers handle special tokens at a higher level. The unk_id() method does resolve through unk_token() if available on the underlying model.
Usage
Use ExLlamaV2TokenizerHF when loading models that ship with a HuggingFace-format tokenizer.json file. This is the default tokenizer backend for most modern LLMs. It is automatically selected by the ExLlamaV2Tokenizer wrapper class during model initialization.
Code Reference
Source Location
Signature
class ExLlamaV2TokenizerHF(ExLlamaV2TokenizerBase):
space_char_: str
newline_char_: str
vocab: list[str] | None
def __init__(self, tokenizer_json: str) -> None: ...
# Special token accessors
def unk_id(self) -> int or None: ...
def pad_id(self) -> int or None: ...
def bos_id(self) -> int or None: ...
def eos_id(self) -> int or None: ...
def unk_token(self) -> str or None: ...
def pad_token(self) -> str or None: ...
def bos_token(self) -> str or None: ...
def eos_token(self) -> str or None: ...
# Character mapping
def space_char(self) -> str: ...
def newline_char(self) -> str: ...
# Core tokenization
def enumerate_tokens(self): ...
def vocab_size(self) -> int: ...
def id_to_piece(self, idx: int) -> str: ...
def piece_to_id(self, text: str) -> int: ...
def decode(self, ids: List[int]) -> str: ...
def encode(self, text: list or str) -> list: ...
Import
from exllamav2.tokenizer.hf import ExLlamaV2TokenizerHF
I/O Contract
__init__()
| Parameter |
Type |
Description
|
| tokenizer_json |
str |
File path to the HuggingFace tokenizer.json file
|
encode()
| Parameter |
Type |
Description
|
| text |
list |
Text string (or list) to encode
|
| Return |
Type |
Description
|
| ids |
list[int] |
List of token IDs (special tokens not added; uses add_special_tokens=False)
|
decode()
| Parameter |
Type |
Description
|
| ids |
List[int] |
List of token IDs to decode
|
| Return |
Type |
Description
|
| text |
str |
Decoded text string
|
enumerate_tokens()
| Return |
Type |
Description
|
| iterator |
enumerate |
Yields (index, decoded_piece) tuples for the entire vocabulary; result is cached after first call
|
id_to_piece() / piece_to_id()
| Method |
Parameter |
Return |
Description
|
| id_to_piece |
idx: int |
str |
Returns the raw token string for a given ID (None-safe, returns "" for None)
|
| piece_to_id |
text: str |
int |
Returns the token ID for a given piece string
|
Usage Examples
from exllamav2.tokenizer.hf import ExLlamaV2TokenizerHF
# Load from a HuggingFace tokenizer.json file
tokenizer = ExLlamaV2TokenizerHF("/path/to/model/tokenizer.json")
# Encode text to token IDs
ids = tokenizer.encode("Hello, world!")
print(ids) # e.g., [15496, 11, 995, 0]
# Decode token IDs back to text
text = tokenizer.decode(ids)
print(text) # "Hello, world!"
# Get vocabulary size
print(tokenizer.vocab_size()) # e.g., 32000
# Convert between pieces and IDs
piece = tokenizer.id_to_piece(15496)
print(piece) # e.g., "Hello"
token_id = tokenizer.piece_to_id("Hello")
print(token_id) # e.g., 15496
# Enumerate the full vocabulary (cached after first call)
for idx, piece in tokenizer.enumerate_tokens():
if idx < 5:
print(f"Token {idx}: {repr(piece)}")
# Check internal character representations
print(repr(tokenizer.space_char())) # e.g., 'G' for BPE models
print(repr(tokenizer.newline_char())) # e.g., 'C' for BPE models
Related Pages