Overview
Python tokenizer interface and metadata configuration for encoding text to token IDs and decoding token IDs back to text in MLC-LLM.
Description
The tokenizers.py module provides the core tokenization layer for MLC-LLM. It contains two components: the TokenizerInfo dataclass for describing tokenizer post-processing behavior, and the Tokenizer runtime class for performing encode/decode operations.
TokenizerInfo is a pure-Python dataclass that captures metadata about how a tokenizer transforms raw token bytes back into readable strings. Different LLM families use different byte-encoding strategies in their tokenizers, and this class records which strategy is in effect:
token_postproc_method -- either "byte_fallback" (used by LLaMA-2, Mixtral, and similar models that encode special bytes as <0xHH> tokens and use the ▁ character for spaces) or "byte_level" (used by LLaMA-3, GPT-2, Phi-2, and similar models that use the GPT-2 bytes-to-unicode mapping where characters like Ġ represent spaces).
prepend_space_in_encode -- whether the tokenizer prepends a space character before encoding input text.
strip_space_in_decode -- whether the tokenizer strips the leading space character when decoding.
The class provides asjson() for JSON serialization and from_json() for deserialization, enabling this metadata to be passed across the Python/C++ boundary.
Tokenizer is a TVM-registered runtime object ("mlc.Tokenizer") that wraps the C++ tokenizer implementation backed by tokenizers-cpp, which itself binds the HuggingFace tokenizers library and SentencePiece. The Python class is a thin FFI wrapper exposing:
encode(text) -- encodes a single text string into a list of token IDs.
encode_batch(texts) -- encodes multiple text strings into a list of token ID lists.
decode(token_ids) -- decodes a list of token IDs back into a text string.
detect_tokenizer_info(tokenizer_path) -- a static method that auto-detects the TokenizerInfo configuration from a tokenizer directory.
Usage
Use Tokenizer whenever you need to convert between text and token IDs in MLC-LLM. It is a dependency for TextStreamer and StopStrHandler (both in streamer.py). The TokenizerInfo dataclass is used by downstream components that need to understand post-processing semantics, such as grammar-guided generation and structured decoding.
Code Reference
Source Location
- Repository: MLC-LLM
- File:
python/mlc_llm/tokenizers/tokenizers.py (Lines 1-130)
TokenizerInfo Dataclass
@dataclass
class TokenizerInfo:
token_postproc_method: Literal["byte_fallback", "byte_level"] = "byte_fallback"
prepend_space_in_encode: bool = False
strip_space_in_decode: bool = False
def asjson(self) -> str:
"""Return the config in string of JSON format."""
return json.dumps(asdict(self))
@staticmethod
def from_json(json_str: str) -> "TokenizerInfo":
"""Construct a config from JSON string."""
return TokenizerInfo(**json.loads(json_str))
Tokenizer Class
@tvm_ffi.register_object("mlc.Tokenizer")
class Tokenizer(Object):
"""The tokenizer class in MLC LLM."""
def __init__(self, tokenizer_path: str) -> None:
"""Create the tokenizer from tokenizer directory path."""
self.__init_handle_by_constructor__(
_ffi_api.Tokenizer,
tokenizer_path,
)
def encode(self, text: str) -> List[int]:
return list(_ffi_api.TokenizerEncode(self, text))
def encode_batch(self, texts: List[str]) -> List[List[int]]:
return list(_ffi_api.TokenizerEncodeBatch(self, texts))
def decode(self, token_ids: List[int]) -> str:
return _ffi_api.TokenizerDecode(self, tvm.runtime.ShapeTuple(token_ids))
@staticmethod
def detect_tokenizer_info(tokenizer_path: str) -> TokenizerInfo:
return TokenizerInfo.from_json(_ffi_api.DetectTokenizerInfo(tokenizer_path))
Import
from mlc_llm.tokenizers import Tokenizer
from mlc_llm.tokenizers.tokenizers import TokenizerInfo
I/O Contract
TokenizerInfo
Fields
| Name |
Type |
Default |
Description
|
| token_postproc_method |
Literal["byte_fallback", "byte_level"] |
"byte_fallback" |
The method for post-processing raw token bytes into readable strings. "byte_fallback" handles tokens like <0x1B> and replaces ▁ with space (LLaMA-2 style). "byte_level" inverts the GPT-2 bytes-to-unicode mapping (LLaMA-3/GPT-2 style).
|
| prepend_space_in_encode |
bool |
False |
Whether the tokenizer prepends a space character to the input text before encoding.
|
| strip_space_in_decode |
bool |
False |
Whether the tokenizer strips the leading space from the decoded output.
|
asjson() Method
| Returns |
Type |
Description
|
| json_string |
str |
JSON-serialized representation of all fields, suitable for passing across the FFI boundary.
|
from_json() Static Method
| Name |
Type |
Required |
Description
|
| json_str |
str |
Yes |
A JSON string containing the tokenizer info fields.
|
| Returns |
Type |
Description
|
| tokenizer_info |
TokenizerInfo |
A reconstructed TokenizerInfo instance.
|
Tokenizer
Constructor Inputs
| Name |
Type |
Required |
Description
|
| tokenizer_path |
str |
Yes |
Path to the tokenizer directory containing tokenizer model files (e.g., tokenizer.json, tokenizer.model).
|
encode() Method
| Name |
Type |
Required |
Description
|
| text |
str |
Yes |
The text string to encode into token IDs.
|
| Returns |
Type |
Description
|
| token_ids |
List[int] |
The list of integer token IDs corresponding to the input text.
|
encode_batch() Method
| Name |
Type |
Required |
Description
|
| texts |
List[str] |
Yes |
A list of text strings to encode in batch.
|
| Returns |
Type |
Description
|
| token_ids_batch |
List[List[int]] |
A list of token ID lists, one per input text string.
|
decode() Method
| Name |
Type |
Required |
Description
|
| token_ids |
List[int] |
Yes |
The token IDs to decode. Internally converted to ShapeTuple before passing to the C++ backend.
|
| Returns |
Type |
Description
|
| text |
str |
The decoded text string.
|
detect_tokenizer_info() Static Method
| Name |
Type |
Required |
Description
|
| tokenizer_path |
str |
Yes |
Path to the tokenizer directory to inspect.
|
| Returns |
Type |
Description
|
| tokenizer_info |
TokenizerInfo |
The auto-detected tokenizer metadata, including post-processing method and space handling flags.
|
Usage Examples
Basic Encode and Decode
from mlc_llm.tokenizers import Tokenizer
tokenizer = Tokenizer("/path/to/model/tokenizer")
# Encode text to token IDs
token_ids = tokenizer.encode("Hello, world!")
print(token_ids) # e.g., [15043, 29892, 3186, 29991]
# Decode token IDs back to text
text = tokenizer.decode(token_ids)
print(text) # "Hello, world!"
Batch Encoding
from mlc_llm.tokenizers import Tokenizer
tokenizer = Tokenizer("/path/to/model/tokenizer")
texts = ["Hello, world!", "How are you?", "MLC LLM is great."]
batch_ids = tokenizer.encode_batch(texts)
for text, ids in zip(texts, batch_ids):
print(f"{text} -> {ids}")
Detecting Tokenizer Info
from mlc_llm.tokenizers import Tokenizer
from mlc_llm.tokenizers.tokenizers import TokenizerInfo
# Auto-detect the tokenizer's post-processing configuration
info = Tokenizer.detect_tokenizer_info("/path/to/llama3/tokenizer")
print(info.token_postproc_method) # "byte_level"
print(info.prepend_space_in_encode) # False
print(info.strip_space_in_decode) # False
# Serialize to JSON for cross-boundary communication
json_str = info.asjson()
print(json_str) # '{"token_postproc_method": "byte_level", ...}'
# Reconstruct from JSON
info_copy = TokenizerInfo.from_json(json_str)
Using Tokenizer with TextStreamer
from mlc_llm.tokenizers import Tokenizer, TextStreamer
tokenizer = Tokenizer("/path/to/model/tokenizer")
streamer = TextStreamer(tokenizer)
# Feed tokens incrementally and stream decoded text
for token_id in generated_token_ids:
delta_text = streamer.put([token_id])
if delta_text:
print(delta_text, end="", flush=True)
print(streamer.finish())
Implementation Details
TVM FFI Bridge
The Tokenizer class is registered as a TVM runtime object under the name "mlc.Tokenizer". Its constructor and all methods delegate to C++ through named FFI functions:
| Python Method |
FFI Function |
Notes
|
Tokenizer.__init__() |
_ffi_api.Tokenizer |
Creates the C++ tokenizer handle from a directory path.
|
Tokenizer.encode() |
_ffi_api.TokenizerEncode |
Returns a TVM container that is converted to a Python list.
|
Tokenizer.encode_batch() |
_ffi_api.TokenizerEncodeBatch |
Returns a nested TVM container converted to a list of lists.
|
Tokenizer.decode() |
_ffi_api.TokenizerDecode |
Accepts a ShapeTuple (converted from the Python list) and returns a string.
|
Tokenizer.detect_tokenizer_info() |
_ffi_api.DetectTokenizerInfo |
Returns a JSON string that is deserialized into a TokenizerInfo instance.
|
Underlying C++ Backend
The C++ tokenizer implementation is backed by tokenizers-cpp, which provides bindings to both the HuggingFace tokenizers Rust library (for BPE/WordPiece/Unigram tokenizers defined in tokenizer.json) and Google's SentencePiece library (for .model files). The Python Tokenizer class is agnostic to which backend is used -- this is determined automatically based on the files present in the tokenizer directory.
Token Post-Processing Methods
The TokenizerInfo.token_postproc_method field describes how raw token strings are mapped back to their original byte representations:
"byte_fallback" (LLaMA-2, Mixtral): Tokens like <0x1B> are converted to their corresponding hex byte values. The special character ▁ (Unicode U+2581, "Lower One Eighth Block") is replaced with a regular space. This is the SentencePiece byte-fallback convention.
"byte_level" (LLaMA-3, GPT-2, Phi-2): The tokenizer uses a reversible bytes-to-unicode mapping originally introduced in the GPT-2 tokenizer. Characters like Ġ (Unicode U+0120) map back to the space byte (0x20). The decoding process inverts this mapping to recover the original byte sequence.
Related Pages
Implements Principle