Principle:Mlc ai Web llm Web Worker Engine Handler
Overview
Web Worker Engine Handler is the architectural pattern for hosting LLM inference in a dedicated Web Worker thread to prevent blocking the main UI thread. By moving all computationally intensive operations -- model loading, prefill, and decode -- off the main thread, the browser's rendering pipeline remains unblocked and the application stays responsive to user input even during long-running inference tasks.
Description
Web Worker engine handling follows the Command pattern over postMessage channels. A handler class running inside a Web Worker receives serialized WorkerRequest messages, routes them to an internal MLCEngine instance, and posts WorkerResponse results back to the main thread.
The handler acts as a message router: it inspects the kind field of each incoming WorkerRequest and dispatches to the corresponding MLCEngine method. Supported request kinds include:
reload-- Load or switch models in the worker-side enginechatCompletionNonStreaming-- Complete a chat request and return the full resultchatCompletionStreamInit-- Initialize a streaming chat completion generatorcompletionStreamNextChunk-- Retrieve the next token chunk from an active streaming generatorcompletionNonStreaming-- Complete a text completion request (non-streaming)completionStreamInit-- Initialize a streaming text completion generatorembedding-- Generate embeddings for input textgetMessage-- Retrieve the current generated messageruntimeStatsText-- Retrieve runtime performance statisticsinterruptGenerate-- Signal the engine to stop generationresetChat-- Reset the chat session stateunload-- Unload the current model and free resourcesforwardTokensAndSample-- Low-level token forwarding and samplingsetLogLevel-- Adjust logging verbositysetAppConfig-- Update the application configuration
Each response is wrapped in a WorkerResponse with a matching uuid so the main thread can correlate responses to their originating requests. Errors are caught and returned as throw-kind responses.
For streaming, the handler uses a generator pattern: the main thread first sends a chatCompletionStreamInit (or completionStreamInit) message to create an AsyncGenerator on the worker side, then requests chunks one at a time via completionStreamNextChunk messages. The worker maintains a Map of per-model async generators (keyed by selectedModelId) so that multiple loaded models can stream concurrently.
Usage
Use this pattern when deploying web-llm in production applications where UI responsiveness is critical. This is the recommended deployment pattern for any user-facing application. The pattern is also suitable for:
- Applications that need to display loading progress bars during model initialization
- Chat interfaces that must remain interactive during inference
- Applications loading multiple models concurrently (e.g., an LLM and an embedding model)
Do not use this pattern when:
- Running inference in a Node.js server environment (no Web Worker API)
- Building a simple demo where main-thread
MLCEnginesuffices - The application has no UI (e.g., a headless script)
Theoretical Basis
Web Workers provide true OS-level thread separation in browsers. Unlike JavaScript's single-threaded event loop, a Web Worker runs in its own thread with its own event loop, heap, and global scope. Communication between the main thread and the worker thread occurs exclusively through the postMessage API, which uses the structured clone algorithm to serialize data across thread boundaries.
The design follows several established patterns:
- Command Pattern: Each
WorkerRequestencapsulates a command with its parameters. Thekindfield identifies the command, and thecontentfield carries the parameters. - Promise-based RPC: Each request carries a
uuid. The main thread creates aPromiseand stores its resolver in apendingPromisemap keyed byuuid. When the worker responds with the sameuuid, the promise is resolved. - Async Generator Bridging: For streaming, the handler maintains an
AsyncGeneratorper loaded model. The main thread proxy creates its ownAsyncGeneratorthat, on eachnext()call, sends acompletionStreamNextChunkmessage and awaits the worker's response.
The message protocol is defined in src/message.ts:
type RequestKind =
| "reload"
| "runtimeStatsText"
| "interruptGenerate"
| "unload"
| "resetChat"
| "getMaxStorageBufferBindingSize"
| "getGPUVendor"
| "forwardTokensAndSample"
| "chatCompletionNonStreaming"
| "completionNonStreaming"
| "embedding"
| "getMessage"
| "chatCompletionStreamInit"
| "completionStreamInit"
| "completionStreamNextChunk"
| "customRequest"
| "keepAlive"
| "setLogLevel"
| "setAppConfig";
export type WorkerRequest = {
kind: RequestKind;
uuid: string;
content: MessageContent;
};
export type WorkerResponse =
| OneTimeWorkerResponse
| InitProgressWorkerResponse
| HeartbeatWorkerResponse;
Related Pages
- Implementation:Mlc_ai_Web_llm_Web_Worker_MLC_Engine_Handler -- Concrete implementation of this pattern
- Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Proxy -- The main-thread proxy that communicates with this handler
- Principle:Mlc_ai_Web_llm_Cross_Thread_Request_Forwarding -- Detailed request forwarding mechanism
- Principle:Mlc_ai_Web_llm_Cross_Thread_Streaming -- Streaming across thread boundaries
- Principle:Mlc_ai_Web_llm_Multi_Model_Routing -- Multi-model request routing