Workflow:Mlc ai Web llm Web Worker Deployment

Knowledge Sources	web-llm WebLLM Docs
Domains	LLMs, WebGPU, Web_Workers, Browser_Inference
Last Updated	2026-02-14 22:00 GMT

Overview

End-to-end process for offloading LLM inference to a Web Worker thread to keep the browser UI responsive during model loading and generation.

Description

This workflow demonstrates the recommended production deployment pattern for web-llm: running the inference engine inside a Web Worker. The main thread creates a WebWorkerMLCEngine proxy that communicates with a WebWorkerMLCEngineHandler running in the worker thread via postMessage. This architecture prevents the computationally intensive model loading and token generation from blocking the main thread, ensuring the UI remains interactive. The API surface is identical to the main-thread MLCEngine, so switching between deployment modes requires minimal code changes.

Usage

Execute this workflow when building a production web application that uses web-llm and needs a responsive UI during model loading and inference. This is the primary recommended deployment pattern for any user-facing application, as main-thread inference will freeze the UI during computation.

Execution Steps

Step 1: Create the Worker Script

Create a dedicated worker file that imports and initializes the WebWorkerMLCEngineHandler. The handler wraps an internal MLCEngine instance and routes incoming messages from the main thread to the appropriate engine methods. The worker script is minimal: instantiate the handler and wire its onmessage to the worker's self.onmessage.

Key considerations:

The worker file must be a separate module (typically with type: "module")
The handler automatically creates an internal MLCEngine when it receives a reload request
Custom logit processors can optionally be registered in the worker via the handler constructor
No model selection or configuration happens in the worker script itself

Step 2: Initialize the Engine from Main Thread

In the main thread, use the CreateWebWorkerMLCEngine factory function. This creates a Worker instance pointing to the worker script, wraps it in a WebWorkerMLCEngine proxy, and triggers model loading. The factory accepts the same model ID, engine config (including initProgressCallback), and chat options as the main-thread CreateMLCEngine.

What happens:

A new Worker is instantiated from the worker script URL
The WebWorkerMLCEngine proxy is created, which serializes API calls into WorkerRequest messages
The proxy sends a reload request to the worker, which triggers model download and initialization
Progress updates are forwarded from the worker back to the main thread via WorkerResponse messages
The initProgressCallback fires on the main thread with loading status

Step 3: Send Chat Completion Requests

Use the same engine.chat.completions.create() API as in the main-thread workflow. The WebWorkerMLCEngine proxy transparently serializes the request, sends it to the worker via postMessage, and deserializes the response. Both streaming and non-streaming modes work identically to the main-thread API.

What happens:

The proxy generates a unique request ID and serializes the ChatCompletionRequest
The message is sent to the worker via postMessage
The worker handler routes it to the internal engine's chatCompletion method
For streaming, the worker sends back a sequence of WorkerResponse messages (one per chunk)
The proxy reconstructs the AsyncGenerator interface on the main thread side

Step 4: Process Streaming Responses

Iterate over the streaming response using a for await loop on the main thread. Each chunk arrives via the worker message channel and is yielded by the proxy's AsyncGenerator. Update the UI incrementally as each chunk arrives, maintaining responsiveness throughout the generation process.

Key considerations:

The for-await loop runs on the main thread but doesn't block it between chunks
engine.getMessage() can retrieve the full response after streaming completes
engine.interruptGenerate() can cancel in-progress generation
Multiple models can be loaded simultaneously by passing an array of model IDs

Step 5: Handle Multi-model Scenarios

Optionally load multiple models into the same worker engine by passing an array of model IDs to CreateWebWorkerMLCEngine. When using multiple models, specify the model parameter in each request to indicate which model should handle it. The engine uses per-model async locks to serialize access, but different models can run concurrently.

Key considerations:

Pass model IDs as an array: [modelId1, modelId2]
Each request must include a model field to disambiguate
engine.getMessage(modelId) requires the model ID parameter when multiple models are loaded
Concurrent requests to different models run in parallel; requests to the same model are queued

Execution Diagram

GitHub URL

Workflow Repository