Principle:Mlc ai Web llm Cross Thread Streaming
Overview
Cross-Thread Streaming is the technique for streaming LLM-generated tokens across Web Worker thread boundaries using async generators and message-based chunk delivery. It bridges the AsyncGenerator pattern across the postMessage boundary, enabling the standard for await (const chunk of stream) idiom on the main thread while the actual token generation happens inside the worker.
Description
Streaming in web-llm follows the OpenAI streaming protocol: when stream: true is set in a request, the API returns an AsyncIterable of ChatCompletionChunk (or Completion) objects. Each chunk contains the incremental delta text produced by a single decode step.
In a direct MLCEngine, streaming is implemented as an async *asyncGenerate() method that yields chunks after each prefill/decode step. However, when the engine runs in a Web Worker, the AsyncGenerator cannot be directly transferred across the thread boundary (generators are not serializable).
The cross-thread streaming solution uses a pull-based protocol with two collaborating generators:
Worker-Side Generator (Real)
The worker maintains the actual AsyncGenerator from MLCEngine.chatCompletion() (or completion()). This generator is stored in the handler's loadedModelIdToAsyncGenerator map, keyed by selectedModelId. When a completionStreamNextChunk message arrives, the handler calls .next() on this generator and sends the yielded value back.
Proxy-Side Generator (Shadow)
The main-thread proxy creates its own async *asyncGenerate(selectedModelId) generator. Each time the caller iterates this generator (via for await or .next()), it:
- Constructs a
completionStreamNextChunkWorkerRequestwith theselectedModelId - Sends it to the worker and awaits the response
- If the response is an object (a
ChatCompletionChunkorCompletion), yields it - If the response is not an object (i.e.,
void/undefined), the worker's generator is exhausted, so it breaks
This creates a synchronized pull-based stream: the proxy requests exactly one chunk at a time, waits for it, yields it to the caller, then requests the next. This avoids buffering issues and provides natural backpressure.
Initialization Protocol
Before chunk retrieval begins, the proxy sends a one-time initialization message:
chatCompletionStreamInit(for chat completions) orcompletionStreamInit(for text completions)
This message carries the full request object. The worker handler:
- Calls
reloadIfUnmatched()to ensure the correct model is loaded - Calls
this.engine.chatCompletion(request)withstream: true, which returns anAsyncGenerator - Stores the generator in
loadedModelIdToAsyncGenerator.set(selectedModelId, generator) - Returns
nullto confirm initialization
Notably, ChatCompletion and Completion streaming share the same chunk generator infrastructure. The completionStreamNextChunk message kind is used for both. The only difference is the type of the yielded object (ChatCompletionChunk vs. Completion). This simplification is possible because the handler maintains the generators per model ID, and a single model processes one streaming request at a time (enforced by CustomLock).
Usage
The streaming pattern is used automatically when stream: true is set:
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: "Tell me a story." }],
model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || "";
process.stdout.write(content);
}
The same pattern works for text completions:
const stream = await engine.completions.create({
prompt: "Once upon a time",
model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.text || "");
}
Theoretical Basis
The design follows the Iterator pattern adapted for asynchronous cross-thread communication:
- Pull semantics: Unlike push-based approaches (e.g.,
ReadableStreamor event-based callbacks), this uses pull semantics where the consumer explicitly requests each chunk. This is natural forfor await...ofloops and provides implicit backpressure. - Generator bridging: The shadow generator on the proxy side mirrors the real generator on the worker side. Each
yieldon the worker side corresponds to ayieldon the proxy side, mediated by a singlepostMessageround trip. - Termination signaling: When the worker's real generator returns (i.e.,
.next()yields{ done: true, value: undefined }), thehandleTaskmethod returnsvoid(undefined) as the content. The proxy detects this by checkingtypeof ret !== "object"and breaks out of its loop.
The message sequence for a complete streaming session:
Main Thread Worker
| |
| chatCompletionStreamInit |
| {request, selectedModelId, modelId} |
|----------------------------------------->|
| | engine.chatCompletion(request) -> generator
| return: null |
|<-----------------------------------------|
| |
| completionStreamNextChunk |
| {selectedModelId} |
|----------------------------------------->|
| | generator.next() -> {value: chunk1}
| return: ChatCompletionChunk |
|<-----------------------------------------|
| yield chunk1 |
| |
| completionStreamNextChunk |
|----------------------------------------->|
| | generator.next() -> {value: chunk2}
| return: ChatCompletionChunk |
|<-----------------------------------------|
| yield chunk2 |
| |
| ... |
| |
| completionStreamNextChunk |
|----------------------------------------->|
| | generator.next() -> {done: true}
| return: void |
|<-----------------------------------------|
| break (generator ends) |
Performance Characteristics
Each token adds one round-trip postMessage latency (typically sub-millisecond). This overhead is negligible compared to the GPU inference time per token (typically 10-100ms+). The pull-based approach also means the main thread is never flooded with messages faster than it can process them.
Related Pages
- Implementation:Mlc_ai_Web_llm_Async_Generate -- Concrete implementation of the proxy-side generator
- Principle:Mlc_ai_Web_llm_Cross_Thread_Request_Forwarding -- The general request forwarding this builds upon
- Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Handler -- The worker-side handler that serves chunk requests
- Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Proxy -- The proxy that hosts the shadow generator
- Principle:Mlc_ai_Web_llm_Multi_Model_Routing -- How per-model generators are managed