Implementation:Mlc ai Web llm Web Worker Chat Completion
Overview
WebWorkerMLCEngine.chatCompletion() is the proxy-side implementation of chat completion that transparently forwards OpenAI-compatible ChatCompletionRequest objects across the Web Worker thread boundary. It handles both streaming and non-streaming requests, returning either a ChatCompletion object or an AsyncIterable<ChatCompletionChunk>.
Description
The chatCompletion() method on WebWorkerMLCEngine implements the same overloaded signatures as MLCEngine.chatCompletion():
chatCompletion(request: ChatCompletionRequestNonStreaming): Promise<ChatCompletion>chatCompletion(request: ChatCompletionRequestStreaming): Promise<AsyncIterable<ChatCompletionChunk>>chatCompletion(request: ChatCompletionRequestBase): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>
The method first validates that a model is loaded (throwing WorkerEngineModelNotLoadedError if not), then resolves the target model using getModelIdToUse(). Based on the stream flag, it follows one of two paths:
Non-streaming path:
- Constructs a
WorkerRequestwithkind: "chatCompletionNonStreaming" - Packs the request, expected
modelId[], andchatOpts[]into the content - Sends via
getPromise<ChatCompletion>()and returns the promise
Streaming path:
- Constructs a
WorkerRequestwithkind: "chatCompletionStreamInit" - Includes the
selectedModelIdso the worker knows which generator to create - Awaits the initialization response (null)
- Returns
this.asyncGenerate(selectedModelId)-- a localAsyncGeneratorthat fetches chunks from the worker one by one
The same pattern also applies to completion() (L727-784) for text completions and embedding() (L786-802) for embeddings, with corresponding request kinds.
Code Reference
Source: src/web_worker.ts, Lines 668-725
async chatCompletion(
request: ChatCompletionRequestNonStreaming,
): Promise<ChatCompletion>;
async chatCompletion(
request: ChatCompletionRequestStreaming,
): Promise<AsyncIterable<ChatCompletionChunk>>;
async chatCompletion(
request: ChatCompletionRequestBase,
): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>;
async chatCompletion(
request: ChatCompletionRequest,
): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion> {
if (this.modelId === undefined) {
throw new WorkerEngineModelNotLoadedError(this.constructor.name);
}
const selectedModelId = getModelIdToUse(
this.modelId ? this.modelId : [],
request.model,
"ChatCompletionRequest",
);
if (request.stream) {
// First let worker instantiate a generator
const msg: WorkerRequest = {
kind: "chatCompletionStreamInit",
uuid: crypto.randomUUID(),
content: {
request: request,
selectedModelId: selectedModelId,
modelId: this.modelId,
chatOpts: this.chatOpts,
},
};
await this.getPromise<null>(msg);
// Then return an async chunk generator that resides on the client side
return this.asyncGenerate(selectedModelId) as AsyncGenerator<
ChatCompletionChunk,
void,
void
>;
}
// Non streaming case is more straightforward
const msg: WorkerRequest = {
kind: "chatCompletionNonStreaming",
uuid: crypto.randomUUID(),
content: {
request: request,
modelId: this.modelId,
chatOpts: this.chatOpts,
},
};
return await this.getPromise<ChatCompletion>(msg);
}
I/O Contract
Input: A ChatCompletionRequest object following the OpenAI API format, containing:
messages-- Array ofChatCompletionMessageParam(system, user, assistant, tool messages)model(optional) -- Target model ID; required when multiple models are loadedstream(optional) -- Iftrue, returns an async iterable of chunkstemperature,top_p,max_tokens,stop, etc. -- Generation parameterstools(optional) -- For function calling supportstream_options(optional) -- E.g.,{ include_usage: true }for usage stats in streaming
Output (non-streaming): A Promise<ChatCompletion> containing:
choices[].message.content-- The generated textchoices[].message.tool_calls-- Function calls (if tools were provided)choices[].finish_reason-- Why generation stopped ("stop","length","tool_calls")usage-- Token counts and performance metrics
Output (streaming): A Promise<AsyncIterable<ChatCompletionChunk>> yielding:
choices[].delta.content-- Incremental text for each chunkchoices[].delta.role--"assistant"choices[].finish_reason--nulluntil the final chunk
Error Conditions:
WorkerEngineModelNotLoadedError-- Ifreload()was not calledSpecifiedModelNotFoundError-- If the requested model is not among the loaded modelsUnclearModelToUseError-- If multiple models are loaded butrequest.modelis not specified
Import
import { CreateWebWorkerMLCEngine } from "@mlc-ai/web-llm";
// chatCompletion is accessed via the engine instance
const engine = await CreateWebWorkerMLCEngine(worker, modelId);
const result = await engine.chatCompletion(request);
// Or equivalently:
const result = await engine.chat.completions.create(request);
Usage Examples
Non-streaming chat completion:
const response = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is the capital of France?" },
],
model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
temperature: 0.7,
max_tokens: 256,
});
console.log(response.choices[0].message.content);
// "The capital of France is Paris."
console.log(response.usage);
// { prompt_tokens: 28, completion_tokens: 8, total_tokens: 36, extra: {...} }
Streaming chat completion:
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: "Write a haiku about programming." }],
model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
stream: true,
stream_options: { include_usage: true },
});
let fullResponse = "";
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content || "";
fullResponse += delta;
// Update UI incrementally
document.getElementById("output").textContent = fullResponse;
}
Function calling through the proxy:
const response = await engine.chat.completions.create({
messages: [{ role: "user", content: "What's the weather in Paris?" }],
model: "Hermes-2-Pro-Llama-3-8B-q4f16_1-MLC",
tools: [{
type: "function",
function: {
name: "get_weather",
description: "Get the weather for a city",
parameters: {
type: "object",
properties: { city: { type: "string" } },
required: ["city"],
},
},
}],
});
if (response.choices[0].message.tool_calls) {
console.log(response.choices[0].message.tool_calls);
// [{ id: "0", type: "function", function: { name: "get_weather", arguments: '{"city":"Paris"}' } }]
}
Related Pages
- Principle:Mlc_ai_Web_llm_Cross_Thread_Request_Forwarding -- The principle this implements
- Implementation:Mlc_ai_Web_llm_Web_Worker_MLC_Engine_Handler -- Worker-side handler that processes these requests
- Implementation:Mlc_ai_Web_llm_Async_Generate -- The streaming generator used by the streaming path
- Implementation:Mlc_ai_Web_llm_Create_Web_Worker_MLC_Engine -- Factory function that creates the engine proxy
- Principle:Mlc_ai_Web_llm_Cross_Thread_Streaming -- Streaming-specific design