Implementation:Mlc ai Mlc llm Tokenizer Py

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Model_Serving, Tokenization
Last Updated	2026-02-09 00:00 GMT

Overview

Python tokenizer interface and metadata configuration for encoding text to token IDs and decoding token IDs back to text in MLC-LLM.

Description

The tokenizers.py module provides the core tokenization layer for MLC-LLM. It contains two components: the TokenizerInfo dataclass for describing tokenizer post-processing behavior, and the Tokenizer runtime class for performing encode/decode operations.

TokenizerInfo is a pure-Python dataclass that captures metadata about how a tokenizer transforms raw token bytes back into readable strings. Different LLM families use different byte-encoding strategies in their tokenizers, and this class records which strategy is in effect:

token_postproc_method -- either "byte_fallback" (used by LLaMA-2, Mixtral, and similar models that encode special bytes as <0xHH> tokens and use the ▁ character for spaces) or "byte_level" (used by LLaMA-3, GPT-2, Phi-2, and similar models that use the GPT-2 bytes-to-unicode mapping where characters like Ġ represent spaces).
prepend_space_in_encode -- whether the tokenizer prepends a space character before encoding input text.
strip_space_in_decode -- whether the tokenizer strips the leading space character when decoding.

The class provides asjson() for JSON serialization and from_json() for deserialization, enabling this metadata to be passed across the Python/C++ boundary.

Tokenizer is a TVM-registered runtime object ("mlc.Tokenizer") that wraps the C++ tokenizer implementation backed by tokenizers-cpp, which itself binds the HuggingFace tokenizers library and SentencePiece. The Python class is a thin FFI wrapper exposing:

encode(text) -- encodes a single text string into a list of token IDs.
encode_batch(texts) -- encodes multiple text strings into a list of token ID lists.
decode(token_ids) -- decodes a list of token IDs back into a text string.
detect_tokenizer_info(tokenizer_path) -- a static method that auto-detects the TokenizerInfo configuration from a tokenizer directory.

Usage

Use Tokenizer whenever you need to convert between text and token IDs in MLC-LLM. It is a dependency for TextStreamer and StopStrHandler (both in streamer.py). The TokenizerInfo dataclass is used by downstream components that need to understand post-processing semantics, such as grammar-guided generation and structured decoding.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/tokenizers/tokenizers.py (Lines 1-130)

TokenizerInfo Dataclass

@dataclass
class TokenizerInfo:
    token_postproc_method: Literal["byte_fallback", "byte_level"] = "byte_fallback"
    prepend_space_in_encode: bool = False
    strip_space_in_decode: bool = False

    def asjson(self) -> str:
        """Return the config in string of JSON format."""
        return json.dumps(asdict(self))

    @staticmethod
    def from_json(json_str: str) -> "TokenizerInfo":
        """Construct a config from JSON string."""
        return TokenizerInfo(**json.loads(json_str))

Tokenizer Class

@tvm_ffi.register_object("mlc.Tokenizer")
class Tokenizer(Object):
    """The tokenizer class in MLC LLM."""

    def __init__(self, tokenizer_path: str) -> None:
        """Create the tokenizer from tokenizer directory path."""
        self.__init_handle_by_constructor__(
            _ffi_api.Tokenizer,
            tokenizer_path,
        )

    def encode(self, text: str) -> List[int]:
        return list(_ffi_api.TokenizerEncode(self, text))

    def encode_batch(self, texts: List[str]) -> List[List[int]]:
        return list(_ffi_api.TokenizerEncodeBatch(self, texts))

    def decode(self, token_ids: List[int]) -> str:
        return _ffi_api.TokenizerDecode(self, tvm.runtime.ShapeTuple(token_ids))

    @staticmethod
    def detect_tokenizer_info(tokenizer_path: str) -> TokenizerInfo:
        return TokenizerInfo.from_json(_ffi_api.DetectTokenizerInfo(tokenizer_path))

Import

from mlc_llm.tokenizers import Tokenizer
from mlc_llm.tokenizers.tokenizers import TokenizerInfo

I/O Contract

TokenizerInfo

Fields

Name	Type	Default	Description
token_postproc_method	`Literal["byte_fallback", "byte_level"]`	`"byte_fallback"`	The method for post-processing raw token bytes into readable strings. `"byte_fallback"` handles tokens like `<0x1B>` and replaces `▁` with space (LLaMA-2 style). `"byte_level"` inverts the GPT-2 bytes-to-unicode mapping (LLaMA-3/GPT-2 style).
prepend_space_in_encode	`bool`	`False`	Whether the tokenizer prepends a space character to the input text before encoding.
strip_space_in_decode	`bool`	`False`	Whether the tokenizer strips the leading space from the decoded output.

`asjson()` Method

Returns	Type	Description
json_string	`str`	JSON-serialized representation of all fields, suitable for passing across the FFI boundary.

`from_json()` Static Method

Name	Type	Required	Description
json_str	`str`	Yes	A JSON string containing the tokenizer info fields.

Returns	Type	Description
tokenizer_info	`TokenizerInfo`	A reconstructed `TokenizerInfo` instance.

Tokenizer

Constructor Inputs

Name	Type	Required	Description
tokenizer_path	`str`	Yes	Path to the tokenizer directory containing tokenizer model files (e.g., `tokenizer.json`, `tokenizer.model`).

`encode()` Method

Name	Type	Required	Description
text	`str`	Yes	The text string to encode into token IDs.

Returns	Type	Description
token_ids	`List[int]`	The list of integer token IDs corresponding to the input text.

`encode_batch()` Method

Name	Type	Required	Description
texts	`List[str]`	Yes	A list of text strings to encode in batch.

Returns	Type	Description
token_ids_batch	`List[List[int]]`	A list of token ID lists, one per input text string.

`decode()` Method

Name	Type	Required	Description
token_ids	`List[int]`	Yes	The token IDs to decode. Internally converted to `ShapeTuple` before passing to the C++ backend.

Returns	Type	Description
text	`str`	The decoded text string.

`detect_tokenizer_info()` Static Method

Name	Type	Required	Description
tokenizer_path	`str`	Yes	Path to the tokenizer directory to inspect.

Returns	Type	Description
tokenizer_info	`TokenizerInfo`	The auto-detected tokenizer metadata, including post-processing method and space handling flags.

Usage Examples

Basic Encode and Decode

from mlc_llm.tokenizers import Tokenizer

tokenizer = Tokenizer("/path/to/model/tokenizer")

# Encode text to token IDs
token_ids = tokenizer.encode("Hello, world!")
print(token_ids)  # e.g., [15043, 29892, 3186, 29991]

# Decode token IDs back to text
text = tokenizer.decode(token_ids)
print(text)  # "Hello, world!"

Batch Encoding

from mlc_llm.tokenizers import Tokenizer

tokenizer = Tokenizer("/path/to/model/tokenizer")

texts = ["Hello, world!", "How are you?", "MLC LLM is great."]
batch_ids = tokenizer.encode_batch(texts)

for text, ids in zip(texts, batch_ids):
    print(f"{text} -> {ids}")

Detecting Tokenizer Info

from mlc_llm.tokenizers import Tokenizer
from mlc_llm.tokenizers.tokenizers import TokenizerInfo

# Auto-detect the tokenizer's post-processing configuration
info = Tokenizer.detect_tokenizer_info("/path/to/llama3/tokenizer")
print(info.token_postproc_method)   # "byte_level"
print(info.prepend_space_in_encode) # False
print(info.strip_space_in_decode)   # False

# Serialize to JSON for cross-boundary communication
json_str = info.asjson()
print(json_str)  # '{"token_postproc_method": "byte_level", ...}'

# Reconstruct from JSON
info_copy = TokenizerInfo.from_json(json_str)

Using Tokenizer with TextStreamer

from mlc_llm.tokenizers import Tokenizer, TextStreamer

tokenizer = Tokenizer("/path/to/model/tokenizer")
streamer = TextStreamer(tokenizer)

# Feed tokens incrementally and stream decoded text
for token_id in generated_token_ids:
    delta_text = streamer.put([token_id])
    if delta_text:
        print(delta_text, end="", flush=True)
print(streamer.finish())

Implementation Details

TVM FFI Bridge

The Tokenizer class is registered as a TVM runtime object under the name "mlc.Tokenizer". Its constructor and all methods delegate to C++ through named FFI functions:

Python Method	FFI Function	Notes
`Tokenizer.__init__()`	`_ffi_api.Tokenizer`	Creates the C++ tokenizer handle from a directory path.
`Tokenizer.encode()`	`_ffi_api.TokenizerEncode`	Returns a TVM container that is converted to a Python list.
`Tokenizer.encode_batch()`	`_ffi_api.TokenizerEncodeBatch`	Returns a nested TVM container converted to a list of lists.
`Tokenizer.decode()`	`_ffi_api.TokenizerDecode`	Accepts a `ShapeTuple` (converted from the Python list) and returns a string.
`Tokenizer.detect_tokenizer_info()`	`_ffi_api.DetectTokenizerInfo`	Returns a JSON string that is deserialized into a `TokenizerInfo` instance.

Underlying C++ Backend

The C++ tokenizer implementation is backed by tokenizers-cpp, which provides bindings to both the HuggingFace tokenizers Rust library (for BPE/WordPiece/Unigram tokenizers defined in tokenizer.json) and Google's SentencePiece library (for .model files). The Python Tokenizer class is agnostic to which backend is used -- this is determined automatically based on the files present in the tokenizer directory.

Token Post-Processing Methods

The TokenizerInfo.token_postproc_method field describes how raw token strings are mapped back to their original byte representations:

"byte_fallback" (LLaMA-2, Mixtral): Tokens like <0x1B> are converted to their corresponding hex byte values. The special character ▁ (Unicode U+2581, "Lower One Eighth Block") is replaced with a regular space. This is the SentencePiece byte-fallback convention.

"byte_level" (LLaMA-3, GPT-2, Phi-2): The tokenizer uses a reversible bytes-to-unicode mapping originally introduced in the GPT-2 tokenizer. Characters like Ġ (Unicode U+0120) map back to the space byte (0x20). The decoding process inverts this mapping to recover the original byte sequence.

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_MLC_Configuration_Generation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment