Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Tokenizer Py

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Serving, Tokenization
Last Updated 2026-02-09 00:00 GMT

Overview

Python tokenizer interface and metadata configuration for encoding text to token IDs and decoding token IDs back to text in MLC-LLM.

Description

The tokenizers.py module provides the core tokenization layer for MLC-LLM. It contains two components: the TokenizerInfo dataclass for describing tokenizer post-processing behavior, and the Tokenizer runtime class for performing encode/decode operations.

TokenizerInfo is a pure-Python dataclass that captures metadata about how a tokenizer transforms raw token bytes back into readable strings. Different LLM families use different byte-encoding strategies in their tokenizers, and this class records which strategy is in effect:

  • token_postproc_method -- either "byte_fallback" (used by LLaMA-2, Mixtral, and similar models that encode special bytes as <0xHH> tokens and use the character for spaces) or "byte_level" (used by LLaMA-3, GPT-2, Phi-2, and similar models that use the GPT-2 bytes-to-unicode mapping where characters like Ġ represent spaces).
  • prepend_space_in_encode -- whether the tokenizer prepends a space character before encoding input text.
  • strip_space_in_decode -- whether the tokenizer strips the leading space character when decoding.

The class provides asjson() for JSON serialization and from_json() for deserialization, enabling this metadata to be passed across the Python/C++ boundary.

Tokenizer is a TVM-registered runtime object ("mlc.Tokenizer") that wraps the C++ tokenizer implementation backed by tokenizers-cpp, which itself binds the HuggingFace tokenizers library and SentencePiece. The Python class is a thin FFI wrapper exposing:

  • encode(text) -- encodes a single text string into a list of token IDs.
  • encode_batch(texts) -- encodes multiple text strings into a list of token ID lists.
  • decode(token_ids) -- decodes a list of token IDs back into a text string.
  • detect_tokenizer_info(tokenizer_path) -- a static method that auto-detects the TokenizerInfo configuration from a tokenizer directory.

Usage

Use Tokenizer whenever you need to convert between text and token IDs in MLC-LLM. It is a dependency for TextStreamer and StopStrHandler (both in streamer.py). The TokenizerInfo dataclass is used by downstream components that need to understand post-processing semantics, such as grammar-guided generation and structured decoding.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/tokenizers/tokenizers.py (Lines 1-130)

TokenizerInfo Dataclass

@dataclass
class TokenizerInfo:
    token_postproc_method: Literal["byte_fallback", "byte_level"] = "byte_fallback"
    prepend_space_in_encode: bool = False
    strip_space_in_decode: bool = False

    def asjson(self) -> str:
        """Return the config in string of JSON format."""
        return json.dumps(asdict(self))

    @staticmethod
    def from_json(json_str: str) -> "TokenizerInfo":
        """Construct a config from JSON string."""
        return TokenizerInfo(**json.loads(json_str))

Tokenizer Class

@tvm_ffi.register_object("mlc.Tokenizer")
class Tokenizer(Object):
    """The tokenizer class in MLC LLM."""

    def __init__(self, tokenizer_path: str) -> None:
        """Create the tokenizer from tokenizer directory path."""
        self.__init_handle_by_constructor__(
            _ffi_api.Tokenizer,
            tokenizer_path,
        )

    def encode(self, text: str) -> List[int]:
        return list(_ffi_api.TokenizerEncode(self, text))

    def encode_batch(self, texts: List[str]) -> List[List[int]]:
        return list(_ffi_api.TokenizerEncodeBatch(self, texts))

    def decode(self, token_ids: List[int]) -> str:
        return _ffi_api.TokenizerDecode(self, tvm.runtime.ShapeTuple(token_ids))

    @staticmethod
    def detect_tokenizer_info(tokenizer_path: str) -> TokenizerInfo:
        return TokenizerInfo.from_json(_ffi_api.DetectTokenizerInfo(tokenizer_path))

Import

from mlc_llm.tokenizers import Tokenizer
from mlc_llm.tokenizers.tokenizers import TokenizerInfo

I/O Contract

TokenizerInfo

Fields

Name Type Default Description
token_postproc_method Literal["byte_fallback", "byte_level"] "byte_fallback" The method for post-processing raw token bytes into readable strings. "byte_fallback" handles tokens like <0x1B> and replaces with space (LLaMA-2 style). "byte_level" inverts the GPT-2 bytes-to-unicode mapping (LLaMA-3/GPT-2 style).
prepend_space_in_encode bool False Whether the tokenizer prepends a space character to the input text before encoding.
strip_space_in_decode bool False Whether the tokenizer strips the leading space from the decoded output.

asjson() Method

Returns Type Description
json_string str JSON-serialized representation of all fields, suitable for passing across the FFI boundary.

from_json() Static Method

Name Type Required Description
json_str str Yes A JSON string containing the tokenizer info fields.
Returns Type Description
tokenizer_info TokenizerInfo A reconstructed TokenizerInfo instance.

Tokenizer

Constructor Inputs

Name Type Required Description
tokenizer_path str Yes Path to the tokenizer directory containing tokenizer model files (e.g., tokenizer.json, tokenizer.model).

encode() Method

Name Type Required Description
text str Yes The text string to encode into token IDs.
Returns Type Description
token_ids List[int] The list of integer token IDs corresponding to the input text.

encode_batch() Method

Name Type Required Description
texts List[str] Yes A list of text strings to encode in batch.
Returns Type Description
token_ids_batch List[List[int]] A list of token ID lists, one per input text string.

decode() Method

Name Type Required Description
token_ids List[int] Yes The token IDs to decode. Internally converted to ShapeTuple before passing to the C++ backend.
Returns Type Description
text str The decoded text string.

detect_tokenizer_info() Static Method

Name Type Required Description
tokenizer_path str Yes Path to the tokenizer directory to inspect.
Returns Type Description
tokenizer_info TokenizerInfo The auto-detected tokenizer metadata, including post-processing method and space handling flags.

Usage Examples

Basic Encode and Decode

from mlc_llm.tokenizers import Tokenizer

tokenizer = Tokenizer("/path/to/model/tokenizer")

# Encode text to token IDs
token_ids = tokenizer.encode("Hello, world!")
print(token_ids)  # e.g., [15043, 29892, 3186, 29991]

# Decode token IDs back to text
text = tokenizer.decode(token_ids)
print(text)  # "Hello, world!"

Batch Encoding

from mlc_llm.tokenizers import Tokenizer

tokenizer = Tokenizer("/path/to/model/tokenizer")

texts = ["Hello, world!", "How are you?", "MLC LLM is great."]
batch_ids = tokenizer.encode_batch(texts)

for text, ids in zip(texts, batch_ids):
    print(f"{text} -> {ids}")

Detecting Tokenizer Info

from mlc_llm.tokenizers import Tokenizer
from mlc_llm.tokenizers.tokenizers import TokenizerInfo

# Auto-detect the tokenizer's post-processing configuration
info = Tokenizer.detect_tokenizer_info("/path/to/llama3/tokenizer")
print(info.token_postproc_method)   # "byte_level"
print(info.prepend_space_in_encode) # False
print(info.strip_space_in_decode)   # False

# Serialize to JSON for cross-boundary communication
json_str = info.asjson()
print(json_str)  # '{"token_postproc_method": "byte_level", ...}'

# Reconstruct from JSON
info_copy = TokenizerInfo.from_json(json_str)

Using Tokenizer with TextStreamer

from mlc_llm.tokenizers import Tokenizer, TextStreamer

tokenizer = Tokenizer("/path/to/model/tokenizer")
streamer = TextStreamer(tokenizer)

# Feed tokens incrementally and stream decoded text
for token_id in generated_token_ids:
    delta_text = streamer.put([token_id])
    if delta_text:
        print(delta_text, end="", flush=True)
print(streamer.finish())

Implementation Details

TVM FFI Bridge

The Tokenizer class is registered as a TVM runtime object under the name "mlc.Tokenizer". Its constructor and all methods delegate to C++ through named FFI functions:

Python Method FFI Function Notes
Tokenizer.__init__() _ffi_api.Tokenizer Creates the C++ tokenizer handle from a directory path.
Tokenizer.encode() _ffi_api.TokenizerEncode Returns a TVM container that is converted to a Python list.
Tokenizer.encode_batch() _ffi_api.TokenizerEncodeBatch Returns a nested TVM container converted to a list of lists.
Tokenizer.decode() _ffi_api.TokenizerDecode Accepts a ShapeTuple (converted from the Python list) and returns a string.
Tokenizer.detect_tokenizer_info() _ffi_api.DetectTokenizerInfo Returns a JSON string that is deserialized into a TokenizerInfo instance.

Underlying C++ Backend

The C++ tokenizer implementation is backed by tokenizers-cpp, which provides bindings to both the HuggingFace tokenizers Rust library (for BPE/WordPiece/Unigram tokenizers defined in tokenizer.json) and Google's SentencePiece library (for .model files). The Python Tokenizer class is agnostic to which backend is used -- this is determined automatically based on the files present in the tokenizer directory.

Token Post-Processing Methods

The TokenizerInfo.token_postproc_method field describes how raw token strings are mapped back to their original byte representations:

  • "byte_fallback" (LLaMA-2, Mixtral): Tokens like <0x1B> are converted to their corresponding hex byte values. The special character (Unicode U+2581, "Lower One Eighth Block") is replaced with a regular space. This is the SentencePiece byte-fallback convention.
  • "byte_level" (LLaMA-3, GPT-2, Phi-2): The tokenizer uses a reversible bytes-to-unicode mapping originally introduced in the GPT-2 tokenizer. Characters like Ġ (Unicode U+0120) map back to the space byte (0x20). The decoding process inverts this mapping to recover the original byte sequence.

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment