Implementation:Mlc ai Mlc llm Text Streamer Py

Knowledge Sources	MLC-LLM
Domains	Deep_Learning, Model_Serving, Tokenization
Last Updated	2026-02-09 00:00 GMT

Overview

Streaming utilities for incrementally decoding tokens into validated UTF-8 text and detecting stop strings during generation in MLC-LLM.

Description

The streamer.py module defines two TVM-registered runtime objects -- TextStreamer and StopStrHandler -- that sit between the token-generation loop and the user-facing output. Both classes extend tvm.runtime.Object and delegate their core logic to C++ implementations via TVM FFI calls, making the Python layer a thin, type-safe wrapper.

TextStreamer accumulates delta tokens produced by the language model and decodes them into UTF-8-valid strings. Because a single token may correspond to only a partial multi-byte character, the streamer buffers tokens internally and releases text only when a complete, valid UTF-8 sequence can be formed. This avoids emitting garbled characters during incremental streaming. It exposes two operations:

put(delta_tokens) -- accepts new token IDs (as a Python list or ShapeTuple), buffers them, and returns whatever portion of the decoded string is UTF-8-valid so far.
finish() -- flushes any remaining buffered tokens and returns the final decoded string.

StopStrHandler monitors the generated token stream for occurrences of user-specified stop strings. Because a stop string may span multiple tokens, the handler buffers incoming tokens and only releases those that are guaranteed not to be part of a stop string. It exposes:

put(token_id) -- accepts a single token ID and returns a list of token IDs that are confirmed safe (not part of any stop string).
finish() -- returns any remaining cached token IDs once generation is complete.
stop_triggered -- a boolean property that indicates whether generation was halted because a stop string was fully matched.

Both classes are constructed with a Tokenizer instance (from tokenizers.py in the same package), which the underlying C++ implementation uses for decode operations.

Usage

Use TextStreamer in any token-by-token generation loop where you need to stream partial results back to the caller as valid UTF-8 text. Use StopStrHandler when the generation request includes stop strings that should terminate output. Both are typically used together inside the MLC-LLM serving engine's generation pipeline.

Code Reference

Source Location

Repository: MLC-LLM
File: python/mlc_llm/tokenizers/streamer.py (Lines 1-86)

TextStreamer Class

@tvm_ffi.register_object("mlc.TextStreamer")
class TextStreamer(Object):
    """The class that streams back validated utf-8 text strings
    that generated by tokenizer.
    """

    def __init__(self, tokenizer: Tokenizer) -> None:
        """Create the text streamer from tokenizer"""
        self.__init_handle_by_constructor__(
            _ffi_api.TextStreamer,
            tokenizer,
        )

    def put(self, delta_tokens: Union[List[int], ShapeTuple]) -> str:
        if isinstance(delta_tokens, list):
            delta_tokens = ShapeTuple(delta_tokens)
        return _ffi_api.TextStreamerPut(self, delta_tokens)

    def finish(self) -> str:
        return _ffi_api.TextStreamerFinish(self)

StopStrHandler Class

@tvm_ffi.register_object("mlc.StopStrHandler")
class StopStrHandler(Object):
    """The stop string handler in MLC LLM, which takes input delta tokens
    one at a time, and return the output delta token before stopping due to
    stop strings."""

    def __init__(self, stop_strs: List[str], tokenizer: Tokenizer) -> None:
        self.__init_handle_by_constructor__(
            _ffi_api.StopStrHandler,
            stop_strs,
            tokenizer,
        )

    def put(self, token_id: int) -> List[int]:
        return list(_ffi_api.StopStrHandlerPut(self, token_id))

    def finish(self) -> List[int]:
        return list(_ffi_api.StopStringHandlerFinish(self))

    @property
    def stop_triggered(self) -> bool:
        return _ffi_api.StopStrHandlerStopTriggered(self)

Import

from mlc_llm.tokenizers import TextStreamer, StopStrHandler

I/O Contract

TextStreamer

Constructor Inputs

Name	Type	Required	Description
tokenizer	`Tokenizer`	Yes	The MLC-LLM tokenizer instance used by the underlying C++ streamer for decode operations.

`put()` Method

Name	Type	Required	Description
delta_tokens	`Union[List[int], ShapeTuple]`	Yes	New token IDs to feed into the streamer. A Python list is automatically converted to `ShapeTuple`.

Returns	Type	Description
delta_text	`str`	The UTF-8-valid portion of the decoded text corresponding to all tokens fed so far (minus any buffered partial characters).

`finish()` Method

Returns	Type	Description
remaining_text	`str`	The decoded string from any tokens that were still buffered internally.

StopStrHandler

Constructor Inputs

Name	Type	Required	Description
stop_strs	`List[str]`	Yes	The list of stop strings that should trigger generation termination.
tokenizer	`Tokenizer`	Yes	The MLC-LLM tokenizer instance used for decoding token sequences to check against stop strings.

`put()` Method

Name	Type	Required	Description
token_id	`int`	Yes	A single new token ID from the generation output.

Returns	Type	Description
safe_tokens	`List[int]`	Token IDs that are confirmed not to be part of any stop string. May be empty if the handler is still buffering.

`finish()` Method

Returns	Type	Description
remaining_tokens	`List[int]`	Any token IDs still cached in the handler when generation completes.

`stop_triggered` Property

Returns	Type	Description
stop_triggered	`bool`	`True` if a stop string was fully matched during generation; `False` otherwise.

Usage Examples

Streaming Text from Token IDs

from mlc_llm.tokenizers import Tokenizer, TextStreamer

tokenizer = Tokenizer("/path/to/tokenizer")
streamer = TextStreamer(tokenizer)

# Simulating incremental token generation
for token_batch in generated_token_batches:
    delta_text = streamer.put(token_batch)
    if delta_text:
        print(delta_text, end="", flush=True)

# Flush any remaining buffered text
final_text = streamer.finish()
print(final_text)

Stop String Detection During Generation

from mlc_llm.tokenizers import Tokenizer, StopStrHandler, TextStreamer

tokenizer = Tokenizer("/path/to/tokenizer")
stop_handler = StopStrHandler(["<|end|>", "\n\n"], tokenizer)
streamer = TextStreamer(tokenizer)

for token_id in generated_tokens:
    safe_tokens = stop_handler.put(token_id)
    if safe_tokens:
        delta_text = streamer.put(safe_tokens)
        print(delta_text, end="", flush=True)
    if stop_handler.stop_triggered:
        break

# Flush remaining tokens and text
remaining = stop_handler.finish()
if remaining:
    print(streamer.put(remaining), end="")
print(streamer.finish())

Implementation Details

TVM FFI Bridge

Both TextStreamer and StopStrHandler are registered as TVM runtime objects via @tvm_ffi.register_object with the names "mlc.TextStreamer" and "mlc.StopStrHandler" respectively. Their constructors use __init_handle_by_constructor__ to create the underlying C++ object handle through TVM's FFI mechanism. All method calls (put, finish, stop_triggered) delegate to named FFI functions:

Python Method	FFI Function
`TextStreamer.put()`	`_ffi_api.TextStreamerPut`
`TextStreamer.finish()`	`_ffi_api.TextStreamerFinish`
`StopStrHandler.put()`	`_ffi_api.StopStrHandlerPut`
`StopStrHandler.finish()`	`_ffi_api.StopStringHandlerFinish`
`StopStrHandler.stop_triggered`	`_ffi_api.StopStrHandlerStopTriggered`

UTF-8 Buffering Strategy

The TextStreamer internally buffers tokens that cannot yet be decoded into complete UTF-8 characters. For example, a multi-byte character (such as a CJK glyph or an emoji) may require two or more tokens to form a valid byte sequence. The streamer only releases text when the accumulated bytes form valid UTF-8, preventing garbled output in streaming scenarios.

Stop String Matching

The StopStrHandler processes tokens one at a time. It maintains an internal buffer of tokens that might partially match one of the configured stop strings. Only tokens that have been conclusively determined to not be part of any stop string are returned by put(). When a stop string is fully matched, the stop_triggered property becomes True, signaling the generation loop to terminate.

Related Pages

Implements Principle

Principle:Mlc_ai_Mlc_llm_Streaming_Response_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Overview

Description

Usage

Code Reference

Source Location

TextStreamer Class

StopStrHandler Class

Import

I/O Contract

TextStreamer

Constructor Inputs

put() Method

finish() Method

StopStrHandler

Constructor Inputs

put() Method

finish() Method

stop_triggered Property

Usage Examples

Streaming Text from Token IDs

Stop String Detection During Generation

Implementation Details

TVM FFI Bridge

UTF-8 Buffering Strategy

Stop String Matching

Related Pages

Implements Principle

Page Connections

`put()` Method

`finish()` Method

`put()` Method

`finish()` Method

`stop_triggered` Property