Principle:Mlc ai Web llm Cross Thread Request Forwarding

Overview

Cross-Thread Request Forwarding is the technique for transparently forwarding OpenAI-compatible API calls across Web Worker thread boundaries via message serialization. This ensures that ChatCompletionRequest, CompletionCreateParams, and EmbeddingCreateParams objects are faithfully transmitted from the main thread proxy to the worker-side engine, and that results are correctly returned.

Description

Cross-thread request forwarding bridges the gap between the main thread's synchronous-looking API calls and the worker's asynchronous message-based interface. The forwarding mechanism handles two distinct flows:

Non-Streaming Flow

For non-streaming requests, the flow is straightforward:

The main-thread proxy's chatCompletion() method validates that a model is loaded
It uses getModelIdToUse() to resolve which model should handle the request
It constructs a WorkerRequest with kind: "chatCompletionNonStreaming" and sends it via getPromise()
The worker handler receives the message, calls reloadIfUnmatched() to ensure the correct model is loaded, then calls this.engine.chatCompletion(request)
The result (ChatCompletion) is serialized back as a "return" response
The proxy resolves the pending promise with the result

The request content for non-streaming chat completion includes:

interface ChatCompletionNonStreamingParams {
  request: ChatCompletionRequestNonStreaming;
  modelId: string[];       // Expected loaded models (for reloadIfUnmatched)
  chatOpts?: ChatOptions[]; // Expected chat options
}

Streaming Flow

For streaming requests, the forwarding uses a two-phase protocol:

Phase 1 -- Initialization:

The proxy sends a chatCompletionStreamInit message containing the full request, the resolved selectedModelId, and the expected model state
The worker handler creates an AsyncGenerator by calling this.engine.chatCompletion(request) with stream: true
The generator is stored in loadedModelIdToAsyncGenerator keyed by selectedModelId
A null return confirms initialization

Phase 2 -- Chunk Retrieval:

The proxy returns its own asyncGenerate(selectedModelId) generator to the caller
Each time the caller iterates (e.g., for await (const chunk of stream)), the proxy's generator sends a completionStreamNextChunk message
The worker looks up the generator for selectedModelId and calls .next()
The yielded ChatCompletionChunk is returned to the proxy
When the generator is exhausted, it returns void, and the proxy breaks out of its loop

The stream initialization parameters:

interface ChatCompletionStreamInitParams {
  request: ChatCompletionRequestStreaming;
  selectedModelId: string;   // Which model's generator to create
  modelId: string[];          // Expected loaded models
  chatOpts?: ChatOptions[];   // Expected chat options
}

Serialization

All message content is serialized using the browser's structured clone algorithm (the default for postMessage). This means:

Plain objects, arrays, strings, numbers, and booleans are deeply cloned
Float32Array and other typed arrays are transferred
Functions, DOM nodes, and class instances with methods are not serializable -- this is why logitProcessorRegistry cannot be used across the worker boundary

State Consistency

The proxy sends its expected modelId[] and chatOpts[] with every inference request. The worker handler's reloadIfUnmatched() compares these against its current state. If they differ (e.g., a service worker was killed and restarted), the handler automatically reloads the expected model before processing the request. This provides at-most-once delivery with automatic recovery.

Usage

This pattern is used implicitly whenever you call any inference API on a WebWorkerMLCEngine:

// Non-streaming -- forwarded as chatCompletionNonStreaming
const result = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Explain quantum computing" }],
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
});

// Streaming -- forwarded as chatCompletionStreamInit + completionStreamNextChunk
const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Write a poem" }],
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  stream: true,
});
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Theoretical Basis

The forwarding mechanism implements a form of Remote Procedure Call (RPC) over the postMessage channel. Key properties:

Correlation: Each request gets a unique UUID via crypto.randomUUID(), enabling multiplexed request-response matching over a single bidirectional channel.
Transparency: The caller cannot distinguish between a local MLCEngine call and a proxied WebWorkerMLCEngine call -- the interface is identical.
Error propagation: Exceptions thrown in the worker are caught, stringified, and sent back as "throw"-kind responses. The proxy then rejects the corresponding promise with the error message.
Ordering: Requests are sent in order, but responses may arrive out of order (each is correlated by UUID). The worker engine internally uses per-model CustomLock to serialize requests to the same model.

The forwarding also works for non-chat APIs:

API Call	Request Kind	Response Type
`engine.chat.completions.create({stream: false})`	`chatCompletionNonStreaming`	`ChatCompletion`
`engine.chat.completions.create({stream: true})`	`chatCompletionStreamInit` + `completionStreamNextChunk`	`AsyncGenerator<ChatCompletionChunk>`
`engine.completions.create({stream: false})`	`completionNonStreaming`	`Completion`
`engine.completions.create({stream: true})`	`completionStreamInit` + `completionStreamNextChunk`	`AsyncGenerator<Completion>`
`engine.embeddings.create()`	`embedding`	`CreateEmbeddingResponse`

Related Pages

Implementation:Mlc_ai_Web_llm_Web_Worker_Chat_Completion -- Concrete implementation of chat completion forwarding
Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Proxy -- The proxy pattern that uses this forwarding mechanism
Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Handler -- The handler that receives forwarded requests
Principle:Mlc_ai_Web_llm_Cross_Thread_Streaming -- Streaming-specific forwarding details
Principle:Mlc_ai_Web_llm_Multi_Model_Routing -- How model selection works during forwarding

Implementation:Mlc_ai_Web_llm_Web_Worker_Chat_Completion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment