Principle:Mlc ai Web llm Cross Thread Request Forwarding
Overview
Cross-Thread Request Forwarding is the technique for transparently forwarding OpenAI-compatible API calls across Web Worker thread boundaries via message serialization. This ensures that ChatCompletionRequest, CompletionCreateParams, and EmbeddingCreateParams objects are faithfully transmitted from the main thread proxy to the worker-side engine, and that results are correctly returned.
Description
Cross-thread request forwarding bridges the gap between the main thread's synchronous-looking API calls and the worker's asynchronous message-based interface. The forwarding mechanism handles two distinct flows:
Non-Streaming Flow
For non-streaming requests, the flow is straightforward:
- The main-thread proxy's
chatCompletion()method validates that a model is loaded - It uses
getModelIdToUse()to resolve which model should handle the request - It constructs a
WorkerRequestwithkind: "chatCompletionNonStreaming"and sends it viagetPromise() - The worker handler receives the message, calls
reloadIfUnmatched()to ensure the correct model is loaded, then callsthis.engine.chatCompletion(request) - The result (
ChatCompletion) is serialized back as a"return"response - The proxy resolves the pending promise with the result
The request content for non-streaming chat completion includes:
interface ChatCompletionNonStreamingParams {
request: ChatCompletionRequestNonStreaming;
modelId: string[]; // Expected loaded models (for reloadIfUnmatched)
chatOpts?: ChatOptions[]; // Expected chat options
}
Streaming Flow
For streaming requests, the forwarding uses a two-phase protocol:
Phase 1 -- Initialization:
- The proxy sends a
chatCompletionStreamInitmessage containing the full request, the resolvedselectedModelId, and the expected model state - The worker handler creates an
AsyncGeneratorby callingthis.engine.chatCompletion(request)withstream: true - The generator is stored in
loadedModelIdToAsyncGeneratorkeyed byselectedModelId - A
nullreturn confirms initialization
Phase 2 -- Chunk Retrieval:
- The proxy returns its own
asyncGenerate(selectedModelId)generator to the caller - Each time the caller iterates (e.g.,
for await (const chunk of stream)), the proxy's generator sends acompletionStreamNextChunkmessage - The worker looks up the generator for
selectedModelIdand calls.next() - The yielded
ChatCompletionChunkis returned to the proxy - When the generator is exhausted, it returns
void, and the proxy breaks out of its loop
The stream initialization parameters:
interface ChatCompletionStreamInitParams {
request: ChatCompletionRequestStreaming;
selectedModelId: string; // Which model's generator to create
modelId: string[]; // Expected loaded models
chatOpts?: ChatOptions[]; // Expected chat options
}
Serialization
All message content is serialized using the browser's structured clone algorithm (the default for postMessage). This means:
- Plain objects, arrays, strings, numbers, and booleans are deeply cloned
Float32Arrayand other typed arrays are transferred- Functions, DOM nodes, and class instances with methods are not serializable -- this is why
logitProcessorRegistrycannot be used across the worker boundary
State Consistency
The proxy sends its expected modelId[] and chatOpts[] with every inference request. The worker handler's reloadIfUnmatched() compares these against its current state. If they differ (e.g., a service worker was killed and restarted), the handler automatically reloads the expected model before processing the request. This provides at-most-once delivery with automatic recovery.
Usage
This pattern is used implicitly whenever you call any inference API on a WebWorkerMLCEngine:
// Non-streaming -- forwarded as chatCompletionNonStreaming
const result = await engine.chat.completions.create({
messages: [{ role: "user", content: "Explain quantum computing" }],
model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
});
// Streaming -- forwarded as chatCompletionStreamInit + completionStreamNextChunk
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: "Write a poem" }],
model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
Theoretical Basis
The forwarding mechanism implements a form of Remote Procedure Call (RPC) over the postMessage channel. Key properties:
- Correlation: Each request gets a unique UUID via
crypto.randomUUID(), enabling multiplexed request-response matching over a single bidirectional channel. - Transparency: The caller cannot distinguish between a local
MLCEnginecall and a proxiedWebWorkerMLCEnginecall -- the interface is identical. - Error propagation: Exceptions thrown in the worker are caught, stringified, and sent back as
"throw"-kind responses. The proxy then rejects the corresponding promise with the error message. - Ordering: Requests are sent in order, but responses may arrive out of order (each is correlated by UUID). The worker engine internally uses per-model
CustomLockto serialize requests to the same model.
The forwarding also works for non-chat APIs:
| API Call | Request Kind | Response Type |
|---|---|---|
engine.chat.completions.create({stream: false}) |
chatCompletionNonStreaming |
ChatCompletion
|
engine.chat.completions.create({stream: true}) |
chatCompletionStreamInit + completionStreamNextChunk |
AsyncGenerator<ChatCompletionChunk>
|
engine.completions.create({stream: false}) |
completionNonStreaming |
Completion
|
engine.completions.create({stream: true}) |
completionStreamInit + completionStreamNextChunk |
AsyncGenerator<Completion>
|
engine.embeddings.create() |
embedding |
CreateEmbeddingResponse
|
Related Pages
- Implementation:Mlc_ai_Web_llm_Web_Worker_Chat_Completion -- Concrete implementation of chat completion forwarding
- Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Proxy -- The proxy pattern that uses this forwarding mechanism
- Principle:Mlc_ai_Web_llm_Web_Worker_Engine_Handler -- The handler that receives forwarded requests
- Principle:Mlc_ai_Web_llm_Cross_Thread_Streaming -- Streaming-specific forwarding details
- Principle:Mlc_ai_Web_llm_Multi_Model_Routing -- How model selection works during forwarding