Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Text Streamer Py

From Leeroopedia
Revision as of 15:52, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Mlc_ai_Mlc_llm_Text_Streamer_Py.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Deep_Learning, Model_Serving, Tokenization
Last Updated 2026-02-09 00:00 GMT

Overview

Streaming utilities for incrementally decoding tokens into validated UTF-8 text and detecting stop strings during generation in MLC-LLM.

Description

The streamer.py module defines two TVM-registered runtime objects -- TextStreamer and StopStrHandler -- that sit between the token-generation loop and the user-facing output. Both classes extend tvm.runtime.Object and delegate their core logic to C++ implementations via TVM FFI calls, making the Python layer a thin, type-safe wrapper.

TextStreamer accumulates delta tokens produced by the language model and decodes them into UTF-8-valid strings. Because a single token may correspond to only a partial multi-byte character, the streamer buffers tokens internally and releases text only when a complete, valid UTF-8 sequence can be formed. This avoids emitting garbled characters during incremental streaming. It exposes two operations:

  • put(delta_tokens) -- accepts new token IDs (as a Python list or ShapeTuple), buffers them, and returns whatever portion of the decoded string is UTF-8-valid so far.
  • finish() -- flushes any remaining buffered tokens and returns the final decoded string.

StopStrHandler monitors the generated token stream for occurrences of user-specified stop strings. Because a stop string may span multiple tokens, the handler buffers incoming tokens and only releases those that are guaranteed not to be part of a stop string. It exposes:

  • put(token_id) -- accepts a single token ID and returns a list of token IDs that are confirmed safe (not part of any stop string).
  • finish() -- returns any remaining cached token IDs once generation is complete.
  • stop_triggered -- a boolean property that indicates whether generation was halted because a stop string was fully matched.

Both classes are constructed with a Tokenizer instance (from tokenizers.py in the same package), which the underlying C++ implementation uses for decode operations.

Usage

Use TextStreamer in any token-by-token generation loop where you need to stream partial results back to the caller as valid UTF-8 text. Use StopStrHandler when the generation request includes stop strings that should terminate output. Both are typically used together inside the MLC-LLM serving engine's generation pipeline.

Code Reference

Source Location

  • Repository: MLC-LLM
  • File: python/mlc_llm/tokenizers/streamer.py (Lines 1-86)

TextStreamer Class

@tvm_ffi.register_object("mlc.TextStreamer")
class TextStreamer(Object):
    """The class that streams back validated utf-8 text strings
    that generated by tokenizer.
    """

    def __init__(self, tokenizer: Tokenizer) -> None:
        """Create the text streamer from tokenizer"""
        self.__init_handle_by_constructor__(
            _ffi_api.TextStreamer,
            tokenizer,
        )

    def put(self, delta_tokens: Union[List[int], ShapeTuple]) -> str:
        if isinstance(delta_tokens, list):
            delta_tokens = ShapeTuple(delta_tokens)
        return _ffi_api.TextStreamerPut(self, delta_tokens)

    def finish(self) -> str:
        return _ffi_api.TextStreamerFinish(self)

StopStrHandler Class

@tvm_ffi.register_object("mlc.StopStrHandler")
class StopStrHandler(Object):
    """The stop string handler in MLC LLM, which takes input delta tokens
    one at a time, and return the output delta token before stopping due to
    stop strings."""

    def __init__(self, stop_strs: List[str], tokenizer: Tokenizer) -> None:
        self.__init_handle_by_constructor__(
            _ffi_api.StopStrHandler,
            stop_strs,
            tokenizer,
        )

    def put(self, token_id: int) -> List[int]:
        return list(_ffi_api.StopStrHandlerPut(self, token_id))

    def finish(self) -> List[int]:
        return list(_ffi_api.StopStringHandlerFinish(self))

    @property
    def stop_triggered(self) -> bool:
        return _ffi_api.StopStrHandlerStopTriggered(self)

Import

from mlc_llm.tokenizers import TextStreamer, StopStrHandler

I/O Contract

TextStreamer

Constructor Inputs

Name Type Required Description
tokenizer Tokenizer Yes The MLC-LLM tokenizer instance used by the underlying C++ streamer for decode operations.

put() Method

Name Type Required Description
delta_tokens Union[List[int], ShapeTuple] Yes New token IDs to feed into the streamer. A Python list is automatically converted to ShapeTuple.
Returns Type Description
delta_text str The UTF-8-valid portion of the decoded text corresponding to all tokens fed so far (minus any buffered partial characters).

finish() Method

Returns Type Description
remaining_text str The decoded string from any tokens that were still buffered internally.

StopStrHandler

Constructor Inputs

Name Type Required Description
stop_strs List[str] Yes The list of stop strings that should trigger generation termination.
tokenizer Tokenizer Yes The MLC-LLM tokenizer instance used for decoding token sequences to check against stop strings.

put() Method

Name Type Required Description
token_id int Yes A single new token ID from the generation output.
Returns Type Description
safe_tokens List[int] Token IDs that are confirmed not to be part of any stop string. May be empty if the handler is still buffering.

finish() Method

Returns Type Description
remaining_tokens List[int] Any token IDs still cached in the handler when generation completes.

stop_triggered Property

Returns Type Description
stop_triggered bool True if a stop string was fully matched during generation; False otherwise.

Usage Examples

Streaming Text from Token IDs

from mlc_llm.tokenizers import Tokenizer, TextStreamer

tokenizer = Tokenizer("/path/to/tokenizer")
streamer = TextStreamer(tokenizer)

# Simulating incremental token generation
for token_batch in generated_token_batches:
    delta_text = streamer.put(token_batch)
    if delta_text:
        print(delta_text, end="", flush=True)

# Flush any remaining buffered text
final_text = streamer.finish()
print(final_text)

Stop String Detection During Generation

from mlc_llm.tokenizers import Tokenizer, StopStrHandler, TextStreamer

tokenizer = Tokenizer("/path/to/tokenizer")
stop_handler = StopStrHandler(["<|end|>", "\n\n"], tokenizer)
streamer = TextStreamer(tokenizer)

for token_id in generated_tokens:
    safe_tokens = stop_handler.put(token_id)
    if safe_tokens:
        delta_text = streamer.put(safe_tokens)
        print(delta_text, end="", flush=True)
    if stop_handler.stop_triggered:
        break

# Flush remaining tokens and text
remaining = stop_handler.finish()
if remaining:
    print(streamer.put(remaining), end="")
print(streamer.finish())

Implementation Details

TVM FFI Bridge

Both TextStreamer and StopStrHandler are registered as TVM runtime objects via @tvm_ffi.register_object with the names "mlc.TextStreamer" and "mlc.StopStrHandler" respectively. Their constructors use __init_handle_by_constructor__ to create the underlying C++ object handle through TVM's FFI mechanism. All method calls (put, finish, stop_triggered) delegate to named FFI functions:

Python Method FFI Function
TextStreamer.put() _ffi_api.TextStreamerPut
TextStreamer.finish() _ffi_api.TextStreamerFinish
StopStrHandler.put() _ffi_api.StopStrHandlerPut
StopStrHandler.finish() _ffi_api.StopStringHandlerFinish
StopStrHandler.stop_triggered _ffi_api.StopStrHandlerStopTriggered

UTF-8 Buffering Strategy

The TextStreamer internally buffers tokens that cannot yet be decoded into complete UTF-8 characters. For example, a multi-byte character (such as a CJK glyph or an emoji) may require two or more tokens to form a valid byte sequence. The streamer only releases text when the accumulated bytes form valid UTF-8, preventing garbled output in streaming scenarios.

Stop String Matching

The StopStrHandler processes tokens one at a time. It maintains an internal buffer of tokens that might partially match one of the configured stop strings. Only tokens that have been conclusively determined to not be part of any stop string are returned by put(). When a stop string is fully matched, the stop_triggered property becomes True, signaling the generation loop to terminate.

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment