Principle:Mlc ai Web llm Extension Service Worker

Overview

Pattern for hosting persistent LLM inference in a Chrome Extension service worker with port-based communication and model caching. The extension service worker acts as the backend that holds the actual MLCEngine instance, receives requests from the popup (or other extension pages) via chrome.runtime.Port, and returns inference results through the same port channel.

Description

The extension service worker pattern extends the Web Worker pattern for Chrome Extensions. Instead of using postMessage on a Worker, it uses chrome.runtime.Port for communication between the popup and the background script. The handler class (ServiceWorkerMLCEngineHandler) extends WebWorkerMLCEngineHandler and overrides the communication layer while inheriting all message routing and task handling logic.

Key architectural differences from the Web Worker pattern:

Communication channel: Uses chrome.runtime.Port instead of the Web Worker postMessage API. The port is established when the popup calls chrome.runtime.connect() and the background script receives it via chrome.runtime.onConnect.

Model caching logic: When the handler receives a reload message, it checks whether the same model is already loaded (matching modelId and chatOpts via areArraysEqual and areChatOptionsListEqual). If so, it skips the full reload and immediately reports completion with 100% progress. This optimization is critical for the extension experience because:
- The popup is destroyed and recreated every time the user clicks the extension icon
- Each popup creation sends a new reload request
- Without caching, the model would be re-downloaded and re-compiled on every popup open

Port lifecycle management: The handler tracks the current port and handles disconnection events. When a port disconnects (popup closes), the handler sets its port reference to null but keeps the engine alive. When a new port connects (popup reopens), setPort() updates the reference.

Keep-alive message filtering: The onmessage handler filters out keepAlive heartbeat messages sent by the client to prevent Chrome from killing the idle service worker.

Inheritance chain:

ServiceWorkerMLCEngineHandler extends WebWorkerMLCEngineHandler, which creates an internal MLCEngine and routes all message types (chat completion, embedding, reset, unload, etc.) to the appropriate engine methods. The extension handler only overrides postMessage, onmessage (to add caching logic for reload), and adds port management methods.

Usage

Use this for the background script of a Chrome extension that runs LLM inference. The service worker persists the model in memory and serves multiple popup connections.

When to apply:

Building a Chrome extension with in-browser LLM inference via @mlc-ai/web-llm
The extension needs the model to remain loaded across popup open/close cycles
The extension requires WebGPU access from the background context

When not to apply:

Standard web applications (use WebWorkerMLCEngineHandler instead)
Extensions that run inference only in the popup (no background persistence needed)
Server-side or Node.js contexts

Typical setup pattern in the background script:

Declare a module-level handler variable (initially undefined)
Listen for chrome.runtime.onConnect events
On first connection, create a new ServiceWorkerMLCEngineHandler with the port
On subsequent connections, call handler.setPort(port) to update the port
Always bind port.onMessage.addListener(handler.onmessage.bind(handler))

Theoretical Basis

The service worker lifecycle in Chrome Extensions is fundamentally different from Web Workers:

Web Workers are created by a page and live as long as that page is open. They use postMessage/onmessage for bidirectional communication.

Extension service workers are event-driven and can be terminated by Chrome after approximately 30 seconds of inactivity. They communicate with extension pages via chrome.runtime.Port (for long-lived connections) or chrome.runtime.sendMessage (for one-shot messages).

The web-llm library bridges this gap by using the chrome.runtime.Port API as a drop-in replacement for the Worker message channel. The ServiceWorkerMLCEngineHandler overrides postMessage to call this.port?.postMessage(msg) instead of the global postMessage, and the onmessage handler receives events from the port's message listener rather than the global onmessage.

The model caching optimization (skip reload if model is already loaded) is essential because Chrome's service worker lifecycle means:

User clicks extension icon -> popup opens -> sends reload request
User closes popup -> service worker may or may not be killed
User clicks again -> popup opens -> sends another reload request
If service worker was NOT killed, the model is still in memory and reload can be skipped

I/O Contract

Input:

A chrome.runtime.Port from chrome.runtime.onConnect
Messages conforming to the WorkerRequest protocol (same as WebWorkerMLCEngineHandler)
Special message type { type: "keepAlive" } for heartbeat filtering

Output:

Messages conforming to the WorkerResponse protocol sent via port.postMessage()
initProgressCallback messages during model loading
return messages with inference results
throw messages with error information

Reload caching behavior:

Condition	Behavior
`modelId` matches AND `chatOpts` match	Skip reload; emit progress callback with `progress: 1` and GPU label
`modelId` differs OR `chatOpts` differ	Perform full `engine.reload()`
WebGPU not available (during skip-reload path)	Throw `WebGPUNotFoundError`

Usage Examples

Background script setup (from the repository example):

import { ExtensionServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

// Hookup an engine to a service worker handler
let handler;

chrome.runtime.onConnect.addListener(function (port) {
  console.assert(port.name === "web_llm_service_worker");
  if (handler === undefined) {
    handler = new ExtensionServiceWorkerMLCEngineHandler(port);
  } else {
    handler.setPort(port);
  }
  port.onMessage.addListener(handler.onmessage.bind(handler));
});

Note on exported names: The library exports ServiceWorkerMLCEngineHandler as the canonical name from src/extension_service_worker.ts. It is also re-exported as ExtensionServiceWorkerMLCEngineHandler from the package index for backward compatibility. Both names reference the same class.

How the caching logic works internally when popup reconnects:

// Inside ServiceWorkerMLCEngineHandler.onmessage():
// When a "reload" message arrives:
if (
  areArraysEqual(this.modelId, params.modelId) &&
  areChatOptionsListEqual(this.chatOpts, params.chatOpts)
) {
  // Model is already loaded with the same configuration.
  // Skip the expensive reload and just report completion.
  log.info("Already loaded the model. Skip loading");
  const gpuDetectOutput = await tvmjs.detectGPUDevice();
  if (gpuDetectOutput == undefined) {
    throw new WebGPUNotFoundError();
  }
  // Report 100% progress with GPU info
  this.engine.getInitProgressCallback()?.({
    progress: 1,
    timeElapsed: 0,
    text: "Finish loading on " + gpuLabel,
  });
  return null;
}
// Otherwise, perform the full model reload
await this.engine.reload(params.modelId, params.chatOpts);

Related Pages

Implementation:Mlc_ai_Web_llm_Service_Worker_MLC_Engine_Handler
Mlc_ai_Web_llm_Chrome_Extension_Manifest - Manifest configuration that registers the service worker
Mlc_ai_Web_llm_Extension_Client_Engine - The popup-side proxy that connects to this service worker
Mlc_ai_Web_llm_Page_Content_Access - Content script pattern that can send page data to this service worker

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment