Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mlc ai Web llm Cross Thread Streaming

From Leeroopedia

Template:Metadata

Overview

Cross-Thread Streaming is the technique for streaming LLM-generated tokens across Web Worker thread boundaries using async generators and message-based chunk delivery. It bridges the AsyncGenerator pattern across the postMessage boundary, enabling the standard for await (const chunk of stream) idiom on the main thread while the actual token generation happens inside the worker.

Description

Streaming in web-llm follows the OpenAI streaming protocol: when stream: true is set in a request, the API returns an AsyncIterable of ChatCompletionChunk (or Completion) objects. Each chunk contains the incremental delta text produced by a single decode step.

In a direct MLCEngine, streaming is implemented as an async *asyncGenerate() method that yields chunks after each prefill/decode step. However, when the engine runs in a Web Worker, the AsyncGenerator cannot be directly transferred across the thread boundary (generators are not serializable).

The cross-thread streaming solution uses a pull-based protocol with two collaborating generators:

Worker-Side Generator (Real)

The worker maintains the actual AsyncGenerator from MLCEngine.chatCompletion() (or completion()). This generator is stored in the handler's loadedModelIdToAsyncGenerator map, keyed by selectedModelId. When a completionStreamNextChunk message arrives, the handler calls .next() on this generator and sends the yielded value back.

Proxy-Side Generator (Shadow)

The main-thread proxy creates its own async *asyncGenerate(selectedModelId) generator. Each time the caller iterates this generator (via for await or .next()), it:

  1. Constructs a completionStreamNextChunk WorkerRequest with the selectedModelId
  2. Sends it to the worker and awaits the response
  3. If the response is an object (a ChatCompletionChunk or Completion), yields it
  4. If the response is not an object (i.e., void/undefined), the worker's generator is exhausted, so it breaks

This creates a synchronized pull-based stream: the proxy requests exactly one chunk at a time, waits for it, yields it to the caller, then requests the next. This avoids buffering issues and provides natural backpressure.

Initialization Protocol

Before chunk retrieval begins, the proxy sends a one-time initialization message:

  • chatCompletionStreamInit (for chat completions) or completionStreamInit (for text completions)

This message carries the full request object. The worker handler:

  1. Calls reloadIfUnmatched() to ensure the correct model is loaded
  2. Calls this.engine.chatCompletion(request) with stream: true, which returns an AsyncGenerator
  3. Stores the generator in loadedModelIdToAsyncGenerator.set(selectedModelId, generator)
  4. Returns null to confirm initialization

Shared Generator for Chat and Text Completions

Notably, ChatCompletion and Completion streaming share the same chunk generator infrastructure. The completionStreamNextChunk message kind is used for both. The only difference is the type of the yielded object (ChatCompletionChunk vs. Completion). This simplification is possible because the handler maintains the generators per model ID, and a single model processes one streaming request at a time (enforced by CustomLock).

Usage

The streaming pattern is used automatically when stream: true is set:

const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Tell me a story." }],
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || "";
  process.stdout.write(content);
}

The same pattern works for text completions:

const stream = await engine.completions.create({
  prompt: "Once upon a time",
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.text || "");
}

Theoretical Basis

The design follows the Iterator pattern adapted for asynchronous cross-thread communication:

  • Pull semantics: Unlike push-based approaches (e.g., ReadableStream or event-based callbacks), this uses pull semantics where the consumer explicitly requests each chunk. This is natural for for await...of loops and provides implicit backpressure.
  • Generator bridging: The shadow generator on the proxy side mirrors the real generator on the worker side. Each yield on the worker side corresponds to a yield on the proxy side, mediated by a single postMessage round trip.
  • Termination signaling: When the worker's real generator returns (i.e., .next() yields { done: true, value: undefined }), the handleTask method returns void (undefined) as the content. The proxy detects this by checking typeof ret !== "object" and breaks out of its loop.

The message sequence for a complete streaming session:

Main Thread                                Worker
     |                                          |
     |  chatCompletionStreamInit                 |
     |  {request, selectedModelId, modelId}     |
     |----------------------------------------->|
     |                                          | engine.chatCompletion(request) -> generator
     |  return: null                             |
     |<-----------------------------------------|
     |                                          |
     |  completionStreamNextChunk               |
     |  {selectedModelId}                       |
     |----------------------------------------->|
     |                                          | generator.next() -> {value: chunk1}
     |  return: ChatCompletionChunk             |
     |<-----------------------------------------|
     |  yield chunk1                            |
     |                                          |
     |  completionStreamNextChunk               |
     |----------------------------------------->|
     |                                          | generator.next() -> {value: chunk2}
     |  return: ChatCompletionChunk             |
     |<-----------------------------------------|
     |  yield chunk2                            |
     |                                          |
     |  ...                                     |
     |                                          |
     |  completionStreamNextChunk               |
     |----------------------------------------->|
     |                                          | generator.next() -> {done: true}
     |  return: void                            |
     |<-----------------------------------------|
     |  break (generator ends)                  |

Performance Characteristics

Each token adds one round-trip postMessage latency (typically sub-millisecond). This overhead is negligible compared to the GPU inference time per token (typically 10-100ms+). The pull-based approach also means the main thread is never flooded with messages faster than it can process them.

Related Pages

Implementation:Mlc_ai_Web_llm_Async_Generate

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment