Principle:Mlc ai Web llm Cross Thread Streaming

Overview

Cross-Thread Streaming is the technique for streaming LLM-generated tokens across Web Worker thread boundaries using async generators and message-based chunk delivery. It bridges the AsyncGenerator pattern across the postMessage boundary, enabling the standard for await (const chunk of stream) idiom on the main thread while the actual token generation happens inside the worker.

Description

Streaming in web-llm follows the OpenAI streaming protocol: when stream: true is set in a request, the API returns an AsyncIterable of ChatCompletionChunk (or Completion) objects. Each chunk contains the incremental delta text produced by a single decode step.

In a direct MLCEngine, streaming is implemented as an async *asyncGenerate() method that yields chunks after each prefill/decode step. However, when the engine runs in a Web Worker, the AsyncGenerator cannot be directly transferred across the thread boundary (generators are not serializable).

The cross-thread streaming solution uses a pull-based protocol with two collaborating generators:

Worker-Side Generator (Real)

The worker maintains the actual AsyncGenerator from MLCEngine.chatCompletion() (or completion()). This generator is stored in the handler's loadedModelIdToAsyncGenerator map, keyed by selectedModelId. When a completionStreamNextChunk message arrives, the handler calls .next() on this generator and sends the yielded value back.

Proxy-Side Generator (Shadow)

The main-thread proxy creates its own async *asyncGenerate(selectedModelId) generator. Each time the caller iterates this generator (via for await or .next()), it:

Constructs a completionStreamNextChunk WorkerRequest with the selectedModelId
Sends it to the worker and awaits the response
If the response is an object (a ChatCompletionChunk or Completion), yields it
If the response is not an object (i.e., void/undefined), the worker's generator is exhausted, so it breaks

This creates a synchronized pull-based stream: the proxy requests exactly one chunk at a time, waits for it, yields it to the caller, then requests the next. This avoids buffering issues and provides natural backpressure.

Initialization Protocol

Before chunk retrieval begins, the proxy sends a one-time initialization message:

chatCompletionStreamInit (for chat completions) or completionStreamInit (for text completions)

This message carries the full request object. The worker handler:

Calls reloadIfUnmatched() to ensure the correct model is loaded
Calls this.engine.chatCompletion(request) with stream: true, which returns an AsyncGenerator
Stores the generator in loadedModelIdToAsyncGenerator.set(selectedModelId, generator)
Returns null to confirm initialization

Shared Generator for Chat and Text Completions

Notably, ChatCompletion and Completion streaming share the same chunk generator infrastructure. The completionStreamNextChunk message kind is used for both. The only difference is the type of the yielded object (ChatCompletionChunk vs. Completion). This simplification is possible because the handler maintains the generators per model ID, and a single model processes one streaming request at a time (enforced by CustomLock).

Usage

The streaming pattern is used automatically when stream: true is set:

const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Tell me a story." }],
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || "";
  process.stdout.write(content);
}

The same pattern works for text completions:

const stream = await engine.completions.create({
  prompt: "Once upon a time",
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.text || "");
}

Theoretical Basis

The design follows the Iterator pattern adapted for asynchronous cross-thread communication:

Pull semantics: Unlike push-based approaches (e.g., ReadableStream or event-based callbacks), this uses pull semantics where the consumer explicitly requests each chunk. This is natural for for await...of loops and provides implicit backpressure.
Generator bridging: The shadow generator on the proxy side mirrors the real generator on the worker side. Each yield on the worker side corresponds to a yield on the proxy side, mediated by a single postMessage round trip.
Termination signaling: When the worker's real generator returns (i.e., .next() yields { done: true, value: undefined }), the handleTask method returns void (undefined) as the content. The proxy detects this by checking typeof ret !== "object" and breaks out of its loop.

The message sequence for a complete streaming session:

Main Thread                                Worker
     |                                          |
     |  chatCompletionStreamInit                 |
     |  {request, selectedModelId, modelId}     |
     |----------------------------------------->|
     |                                          | engine.chatCompletion(request) -> generator
     |  return: null                             |
     |<-----------------------------------------|
     |                                          |
     |  completionStreamNextChunk               |
     |  {selectedModelId}                       |
     |----------------------------------------->|
     |                                          | generator.next() -> {value: chunk1}
     |  return: ChatCompletionChunk             |
     |<-----------------------------------------|
     |  yield chunk1                            |
     |                                          |
     |  completionStreamNextChunk               |
     |----------------------------------------->|
     |                                          | generator.next() -> {value: chunk2}
     |  return: ChatCompletionChunk             |
     |<-----------------------------------------|
     |  yield chunk2                            |
     |                                          |
     |  ...                                     |
     |                                          |
     |  completionStreamNextChunk               |
     |----------------------------------------->|
     |                                          | generator.next() -> {done: true}
     |  return: void                            |
     |<-----------------------------------------|
     |  break (generator ends)                  |

Performance Characteristics

Each token adds one round-trip postMessage latency (typically sub-millisecond). This overhead is negligible compared to the GPU inference time per token (typically 10-100ms+). The pull-based approach also means the main thread is never flooded with messages faster than it can process them.

Related Pages

Implementation:Mlc_ai_Web_llm_Async_Generate -- Concrete implementation of the proxy-side generator
Principle:Mlc_ai_Web_llm_Cross_Thread_Request_Forwarding -- The general request forwarding this builds upon
Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Handler -- The worker-side handler that serves chunk requests
Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Proxy -- The proxy that hosts the shadow generator
Principle:Mlc_ai_Web_llm_Multi_Model_Routing -- How per-model generators are managed

Implementation:Mlc_ai_Web_llm_Async_Generate

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment