Principle:Mlc ai Web llm Extension Service Worker
Overview
Pattern for hosting persistent LLM inference in a Chrome Extension service worker with port-based communication and model caching. The extension service worker acts as the backend that holds the actual MLCEngine instance, receives requests from the popup (or other extension pages) via chrome.runtime.Port, and returns inference results through the same port channel.
Description
The extension service worker pattern extends the Web Worker pattern for Chrome Extensions. Instead of using postMessage on a Worker, it uses chrome.runtime.Port for communication between the popup and the background script. The handler class (ServiceWorkerMLCEngineHandler) extends WebWorkerMLCEngineHandler and overrides the communication layer while inheriting all message routing and task handling logic.
Key architectural differences from the Web Worker pattern:
- Communication channel: Uses
chrome.runtime.Portinstead of the Web WorkerpostMessageAPI. The port is established when the popup callschrome.runtime.connect()and the background script receives it viachrome.runtime.onConnect.
- Model caching logic: When the handler receives a
reloadmessage, it checks whether the same model is already loaded (matchingmodelIdandchatOptsviaareArraysEqualandareChatOptionsListEqual). If so, it skips the full reload and immediately reports completion with 100% progress. This optimization is critical for the extension experience because:- The popup is destroyed and recreated every time the user clicks the extension icon
- Each popup creation sends a new
reloadrequest - Without caching, the model would be re-downloaded and re-compiled on every popup open
- Port lifecycle management: The handler tracks the current port and handles disconnection events. When a port disconnects (popup closes), the handler sets its port reference to
nullbut keeps the engine alive. When a new port connects (popup reopens),setPort()updates the reference.
- Keep-alive message filtering: The
onmessagehandler filters outkeepAliveheartbeat messages sent by the client to prevent Chrome from killing the idle service worker.
Inheritance chain:
ServiceWorkerMLCEngineHandler extends WebWorkerMLCEngineHandler, which creates an internal MLCEngine and routes all message types (chat completion, embedding, reset, unload, etc.) to the appropriate engine methods. The extension handler only overrides postMessage, onmessage (to add caching logic for reload), and adds port management methods.
Usage
Use this for the background script of a Chrome extension that runs LLM inference. The service worker persists the model in memory and serves multiple popup connections.
When to apply:
- Building a Chrome extension with in-browser LLM inference via
@mlc-ai/web-llm - The extension needs the model to remain loaded across popup open/close cycles
- The extension requires WebGPU access from the background context
When not to apply:
- Standard web applications (use
WebWorkerMLCEngineHandlerinstead) - Extensions that run inference only in the popup (no background persistence needed)
- Server-side or Node.js contexts
Typical setup pattern in the background script:
- Declare a module-level
handlervariable (initiallyundefined) - Listen for
chrome.runtime.onConnectevents - On first connection, create a new
ServiceWorkerMLCEngineHandlerwith the port - On subsequent connections, call
handler.setPort(port)to update the port - Always bind
port.onMessage.addListener(handler.onmessage.bind(handler))
Theoretical Basis
The service worker lifecycle in Chrome Extensions is fundamentally different from Web Workers:
- Web Workers are created by a page and live as long as that page is open. They use
postMessage/onmessagefor bidirectional communication.
- Extension service workers are event-driven and can be terminated by Chrome after approximately 30 seconds of inactivity. They communicate with extension pages via
chrome.runtime.Port(for long-lived connections) orchrome.runtime.sendMessage(for one-shot messages).
The web-llm library bridges this gap by using the chrome.runtime.Port API as a drop-in replacement for the Worker message channel. The ServiceWorkerMLCEngineHandler overrides postMessage to call this.port?.postMessage(msg) instead of the global postMessage, and the onmessage handler receives events from the port's message listener rather than the global onmessage.
The model caching optimization (skip reload if model is already loaded) is essential because Chrome's service worker lifecycle means:
- User clicks extension icon -> popup opens -> sends
reloadrequest - User closes popup -> service worker may or may not be killed
- User clicks again -> popup opens -> sends another
reloadrequest - If service worker was NOT killed, the model is still in memory and reload can be skipped
I/O Contract
Input:
- A
chrome.runtime.Portfromchrome.runtime.onConnect - Messages conforming to the
WorkerRequestprotocol (same asWebWorkerMLCEngineHandler) - Special message type
{ type: "keepAlive" }for heartbeat filtering
Output:
- Messages conforming to the
WorkerResponseprotocol sent viaport.postMessage() initProgressCallbackmessages during model loadingreturnmessages with inference resultsthrowmessages with error information
Reload caching behavior:
| Condition | Behavior |
|---|---|
modelId matches AND chatOpts match |
Skip reload; emit progress callback with progress: 1 and GPU label
|
modelId differs OR chatOpts differ |
Perform full engine.reload()
|
| WebGPU not available (during skip-reload path) | Throw WebGPUNotFoundError
|
Usage Examples
Background script setup (from the repository example):
import { ExtensionServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";
// Hookup an engine to a service worker handler
let handler;
chrome.runtime.onConnect.addListener(function (port) {
console.assert(port.name === "web_llm_service_worker");
if (handler === undefined) {
handler = new ExtensionServiceWorkerMLCEngineHandler(port);
} else {
handler.setPort(port);
}
port.onMessage.addListener(handler.onmessage.bind(handler));
});
Note on exported names: The library exports ServiceWorkerMLCEngineHandler as the canonical name from src/extension_service_worker.ts. It is also re-exported as ExtensionServiceWorkerMLCEngineHandler from the package index for backward compatibility. Both names reference the same class.
How the caching logic works internally when popup reconnects:
// Inside ServiceWorkerMLCEngineHandler.onmessage():
// When a "reload" message arrives:
if (
areArraysEqual(this.modelId, params.modelId) &&
areChatOptionsListEqual(this.chatOpts, params.chatOpts)
) {
// Model is already loaded with the same configuration.
// Skip the expensive reload and just report completion.
log.info("Already loaded the model. Skip loading");
const gpuDetectOutput = await tvmjs.detectGPUDevice();
if (gpuDetectOutput == undefined) {
throw new WebGPUNotFoundError();
}
// Report 100% progress with GPU info
this.engine.getInitProgressCallback()?.({
progress: 1,
timeElapsed: 0,
text: "Finish loading on " + gpuLabel,
});
return null;
}
// Otherwise, perform the full model reload
await this.engine.reload(params.modelId, params.chatOpts);
Related Pages
- Implementation:Mlc_ai_Web_llm_Service_Worker_MLC_Engine_Handler
- Mlc_ai_Web_llm_Chrome_Extension_Manifest - Manifest configuration that registers the service worker
- Mlc_ai_Web_llm_Extension_Client_Engine - The popup-side proxy that connects to this service worker
- Mlc_ai_Web_llm_Page_Content_Access - Content script pattern that can send page data to this service worker