Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mlc ai Web llm Web Worker Engine Handler

From Leeroopedia

Template:Metadata

Overview

Web Worker Engine Handler is the architectural pattern for hosting LLM inference in a dedicated Web Worker thread to prevent blocking the main UI thread. By moving all computationally intensive operations -- model loading, prefill, and decode -- off the main thread, the browser's rendering pipeline remains unblocked and the application stays responsive to user input even during long-running inference tasks.

Description

Web Worker engine handling follows the Command pattern over postMessage channels. A handler class running inside a Web Worker receives serialized WorkerRequest messages, routes them to an internal MLCEngine instance, and posts WorkerResponse results back to the main thread.

The handler acts as a message router: it inspects the kind field of each incoming WorkerRequest and dispatches to the corresponding MLCEngine method. Supported request kinds include:

  • reload -- Load or switch models in the worker-side engine
  • chatCompletionNonStreaming -- Complete a chat request and return the full result
  • chatCompletionStreamInit -- Initialize a streaming chat completion generator
  • completionStreamNextChunk -- Retrieve the next token chunk from an active streaming generator
  • completionNonStreaming -- Complete a text completion request (non-streaming)
  • completionStreamInit -- Initialize a streaming text completion generator
  • embedding -- Generate embeddings for input text
  • getMessage -- Retrieve the current generated message
  • runtimeStatsText -- Retrieve runtime performance statistics
  • interruptGenerate -- Signal the engine to stop generation
  • resetChat -- Reset the chat session state
  • unload -- Unload the current model and free resources
  • forwardTokensAndSample -- Low-level token forwarding and sampling
  • setLogLevel -- Adjust logging verbosity
  • setAppConfig -- Update the application configuration

Each response is wrapped in a WorkerResponse with a matching uuid so the main thread can correlate responses to their originating requests. Errors are caught and returned as throw-kind responses.

For streaming, the handler uses a generator pattern: the main thread first sends a chatCompletionStreamInit (or completionStreamInit) message to create an AsyncGenerator on the worker side, then requests chunks one at a time via completionStreamNextChunk messages. The worker maintains a Map of per-model async generators (keyed by selectedModelId) so that multiple loaded models can stream concurrently.

Usage

Use this pattern when deploying web-llm in production applications where UI responsiveness is critical. This is the recommended deployment pattern for any user-facing application. The pattern is also suitable for:

  • Applications that need to display loading progress bars during model initialization
  • Chat interfaces that must remain interactive during inference
  • Applications loading multiple models concurrently (e.g., an LLM and an embedding model)

Do not use this pattern when:

  • Running inference in a Node.js server environment (no Web Worker API)
  • Building a simple demo where main-thread MLCEngine suffices
  • The application has no UI (e.g., a headless script)

Theoretical Basis

Web Workers provide true OS-level thread separation in browsers. Unlike JavaScript's single-threaded event loop, a Web Worker runs in its own thread with its own event loop, heap, and global scope. Communication between the main thread and the worker thread occurs exclusively through the postMessage API, which uses the structured clone algorithm to serialize data across thread boundaries.

The design follows several established patterns:

  • Command Pattern: Each WorkerRequest encapsulates a command with its parameters. The kind field identifies the command, and the content field carries the parameters.
  • Promise-based RPC: Each request carries a uuid. The main thread creates a Promise and stores its resolver in a pendingPromise map keyed by uuid. When the worker responds with the same uuid, the promise is resolved.
  • Async Generator Bridging: For streaming, the handler maintains an AsyncGenerator per loaded model. The main thread proxy creates its own AsyncGenerator that, on each next() call, sends a completionStreamNextChunk message and awaits the worker's response.

The message protocol is defined in src/message.ts:

type RequestKind =
  | "reload"
  | "runtimeStatsText"
  | "interruptGenerate"
  | "unload"
  | "resetChat"
  | "getMaxStorageBufferBindingSize"
  | "getGPUVendor"
  | "forwardTokensAndSample"
  | "chatCompletionNonStreaming"
  | "completionNonStreaming"
  | "embedding"
  | "getMessage"
  | "chatCompletionStreamInit"
  | "completionStreamInit"
  | "completionStreamNextChunk"
  | "customRequest"
  | "keepAlive"
  | "setLogLevel"
  | "setAppConfig";

export type WorkerRequest = {
  kind: RequestKind;
  uuid: string;
  content: MessageContent;
};

export type WorkerResponse =
  | OneTimeWorkerResponse
  | InitProgressWorkerResponse
  | HeartbeatWorkerResponse;

Related Pages

Implementation:Mlc_ai_Web_llm_Web_Worker_MLC_Engine_Handler

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment