Workflow:Mlc ai Web llm Web Worker Deployment
| Knowledge Sources | |
|---|---|
| Domains | LLMs, WebGPU, Web_Workers, Browser_Inference |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
End-to-end process for offloading LLM inference to a Web Worker thread to keep the browser UI responsive during model loading and generation.
Description
This workflow demonstrates the recommended production deployment pattern for web-llm: running the inference engine inside a Web Worker. The main thread creates a WebWorkerMLCEngine proxy that communicates with a WebWorkerMLCEngineHandler running in the worker thread via postMessage. This architecture prevents the computationally intensive model loading and token generation from blocking the main thread, ensuring the UI remains interactive. The API surface is identical to the main-thread MLCEngine, so switching between deployment modes requires minimal code changes.
Usage
Execute this workflow when building a production web application that uses web-llm and needs a responsive UI during model loading and inference. This is the primary recommended deployment pattern for any user-facing application, as main-thread inference will freeze the UI during computation.
Execution Steps
Step 1: Create the Worker Script
Create a dedicated worker file that imports and initializes the WebWorkerMLCEngineHandler. The handler wraps an internal MLCEngine instance and routes incoming messages from the main thread to the appropriate engine methods. The worker script is minimal: instantiate the handler and wire its onmessage to the worker's self.onmessage.
Key considerations:
- The worker file must be a separate module (typically with type: "module")
- The handler automatically creates an internal MLCEngine when it receives a reload request
- Custom logit processors can optionally be registered in the worker via the handler constructor
- No model selection or configuration happens in the worker script itself
Step 2: Initialize the Engine from Main Thread
In the main thread, use the CreateWebWorkerMLCEngine factory function. This creates a Worker instance pointing to the worker script, wraps it in a WebWorkerMLCEngine proxy, and triggers model loading. The factory accepts the same model ID, engine config (including initProgressCallback), and chat options as the main-thread CreateMLCEngine.
What happens:
- A new Worker is instantiated from the worker script URL
- The WebWorkerMLCEngine proxy is created, which serializes API calls into WorkerRequest messages
- The proxy sends a reload request to the worker, which triggers model download and initialization
- Progress updates are forwarded from the worker back to the main thread via WorkerResponse messages
- The initProgressCallback fires on the main thread with loading status
Step 3: Send Chat Completion Requests
Use the same engine.chat.completions.create() API as in the main-thread workflow. The WebWorkerMLCEngine proxy transparently serializes the request, sends it to the worker via postMessage, and deserializes the response. Both streaming and non-streaming modes work identically to the main-thread API.
What happens:
- The proxy generates a unique request ID and serializes the ChatCompletionRequest
- The message is sent to the worker via postMessage
- The worker handler routes it to the internal engine's chatCompletion method
- For streaming, the worker sends back a sequence of WorkerResponse messages (one per chunk)
- The proxy reconstructs the AsyncGenerator interface on the main thread side
Step 4: Process Streaming Responses
Iterate over the streaming response using a for await loop on the main thread. Each chunk arrives via the worker message channel and is yielded by the proxy's AsyncGenerator. Update the UI incrementally as each chunk arrives, maintaining responsiveness throughout the generation process.
Key considerations:
- The for-await loop runs on the main thread but doesn't block it between chunks
- engine.getMessage() can retrieve the full response after streaming completes
- engine.interruptGenerate() can cancel in-progress generation
- Multiple models can be loaded simultaneously by passing an array of model IDs
Step 5: Handle Multi-model Scenarios
Optionally load multiple models into the same worker engine by passing an array of model IDs to CreateWebWorkerMLCEngine. When using multiple models, specify the model parameter in each request to indicate which model should handle it. The engine uses per-model async locks to serialize access, but different models can run concurrently.
Key considerations:
- Pass model IDs as an array: [modelId1, modelId2]
- Each request must include a model field to disambiguate
- engine.getMessage(modelId) requires the model ID parameter when multiple models are loaded
- Concurrent requests to different models run in parallel; requests to the same model are queued