Principle:Mlc ai Web llm Web Worker Engine Handler

Overview

Web Worker Engine Handler is the architectural pattern for hosting LLM inference in a dedicated Web Worker thread to prevent blocking the main UI thread. By moving all computationally intensive operations -- model loading, prefill, and decode -- off the main thread, the browser's rendering pipeline remains unblocked and the application stays responsive to user input even during long-running inference tasks.

Description

Web Worker engine handling follows the Command pattern over postMessage channels. A handler class running inside a Web Worker receives serialized WorkerRequest messages, routes them to an internal MLCEngine instance, and posts WorkerResponse results back to the main thread.

The handler acts as a message router: it inspects the kind field of each incoming WorkerRequest and dispatches to the corresponding MLCEngine method. Supported request kinds include:

reload -- Load or switch models in the worker-side engine
chatCompletionNonStreaming -- Complete a chat request and return the full result
chatCompletionStreamInit -- Initialize a streaming chat completion generator
completionStreamNextChunk -- Retrieve the next token chunk from an active streaming generator
completionNonStreaming -- Complete a text completion request (non-streaming)
completionStreamInit -- Initialize a streaming text completion generator
embedding -- Generate embeddings for input text
getMessage -- Retrieve the current generated message
runtimeStatsText -- Retrieve runtime performance statistics
interruptGenerate -- Signal the engine to stop generation
resetChat -- Reset the chat session state
unload -- Unload the current model and free resources
forwardTokensAndSample -- Low-level token forwarding and sampling
setLogLevel -- Adjust logging verbosity
setAppConfig -- Update the application configuration

Each response is wrapped in a WorkerResponse with a matching uuid so the main thread can correlate responses to their originating requests. Errors are caught and returned as throw-kind responses.

For streaming, the handler uses a generator pattern: the main thread first sends a chatCompletionStreamInit (or completionStreamInit) message to create an AsyncGenerator on the worker side, then requests chunks one at a time via completionStreamNextChunk messages. The worker maintains a Map of per-model async generators (keyed by selectedModelId) so that multiple loaded models can stream concurrently.

Usage

Use this pattern when deploying web-llm in production applications where UI responsiveness is critical. This is the recommended deployment pattern for any user-facing application. The pattern is also suitable for:

Applications that need to display loading progress bars during model initialization
Chat interfaces that must remain interactive during inference
Applications loading multiple models concurrently (e.g., an LLM and an embedding model)

Do not use this pattern when:

Running inference in a Node.js server environment (no Web Worker API)
Building a simple demo where main-thread MLCEngine suffices
The application has no UI (e.g., a headless script)

Theoretical Basis

Web Workers provide true OS-level thread separation in browsers. Unlike JavaScript's single-threaded event loop, a Web Worker runs in its own thread with its own event loop, heap, and global scope. Communication between the main thread and the worker thread occurs exclusively through the postMessage API, which uses the structured clone algorithm to serialize data across thread boundaries.

The design follows several established patterns:

Command Pattern: Each WorkerRequest encapsulates a command with its parameters. The kind field identifies the command, and the content field carries the parameters.
Promise-based RPC: Each request carries a uuid. The main thread creates a Promise and stores its resolver in a pendingPromise map keyed by uuid. When the worker responds with the same uuid, the promise is resolved.
Async Generator Bridging: For streaming, the handler maintains an AsyncGenerator per loaded model. The main thread proxy creates its own AsyncGenerator that, on each next() call, sends a completionStreamNextChunk message and awaits the worker's response.

The message protocol is defined in src/message.ts:

type RequestKind =
  | "reload"
  | "runtimeStatsText"
  | "interruptGenerate"
  | "unload"
  | "resetChat"
  | "getMaxStorageBufferBindingSize"
  | "getGPUVendor"
  | "forwardTokensAndSample"
  | "chatCompletionNonStreaming"
  | "completionNonStreaming"
  | "embedding"
  | "getMessage"
  | "chatCompletionStreamInit"
  | "completionStreamInit"
  | "completionStreamNextChunk"
  | "customRequest"
  | "keepAlive"
  | "setLogLevel"
  | "setAppConfig";

export type WorkerRequest = {
  kind: RequestKind;
  uuid: string;
  content: MessageContent;
};

export type WorkerResponse =
  | OneTimeWorkerResponse
  | InitProgressWorkerResponse
  | HeartbeatWorkerResponse;

Related Pages

Implementation:Mlc_ai_Web_llm_Web_Worker_MLC_Engine_Handler -- Concrete implementation of this pattern
Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Proxy -- The main-thread proxy that communicates with this handler
Principle:Mlc_ai_Web_llm_Cross_Thread_Request_Forwarding -- Detailed request forwarding mechanism
Principle:Mlc_ai_Web_llm_Cross_Thread_Streaming -- Streaming across thread boundaries
Principle:Mlc_ai_Web_llm_Multi_Model_Routing -- Multi-model request routing

Implementation:Mlc_ai_Web_llm_Web_Worker_MLC_Engine_Handler

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment