Implementation:Mlc ai Web llm Web Worker Chat Completion

Overview

WebWorkerMLCEngine.chatCompletion() is the proxy-side implementation of chat completion that transparently forwards OpenAI-compatible ChatCompletionRequest objects across the Web Worker thread boundary. It handles both streaming and non-streaming requests, returning either a ChatCompletion object or an AsyncIterable<ChatCompletionChunk>.

Description

The chatCompletion() method on WebWorkerMLCEngine implements the same overloaded signatures as MLCEngine.chatCompletion():

chatCompletion(request: ChatCompletionRequestNonStreaming): Promise<ChatCompletion>
chatCompletion(request: ChatCompletionRequestStreaming): Promise<AsyncIterable<ChatCompletionChunk>>
chatCompletion(request: ChatCompletionRequestBase): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>

The method first validates that a model is loaded (throwing WorkerEngineModelNotLoadedError if not), then resolves the target model using getModelIdToUse(). Based on the stream flag, it follows one of two paths:

Non-streaming path:

Constructs a WorkerRequest with kind: "chatCompletionNonStreaming"
Packs the request, expected modelId[], and chatOpts[] into the content
Sends via getPromise<ChatCompletion>() and returns the promise

Streaming path:

Constructs a WorkerRequest with kind: "chatCompletionStreamInit"
Includes the selectedModelId so the worker knows which generator to create
Awaits the initialization response (null)
Returns this.asyncGenerate(selectedModelId) -- a local AsyncGenerator that fetches chunks from the worker one by one

The same pattern also applies to completion() (L727-784) for text completions and embedding() (L786-802) for embeddings, with corresponding request kinds.

Code Reference

Source: src/web_worker.ts, Lines 668-725

async chatCompletion(
  request: ChatCompletionRequestNonStreaming,
): Promise<ChatCompletion>;
async chatCompletion(
  request: ChatCompletionRequestStreaming,
): Promise<AsyncIterable<ChatCompletionChunk>>;
async chatCompletion(
  request: ChatCompletionRequestBase,
): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>;
async chatCompletion(
  request: ChatCompletionRequest,
): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion> {
  if (this.modelId === undefined) {
    throw new WorkerEngineModelNotLoadedError(this.constructor.name);
  }
  const selectedModelId = getModelIdToUse(
    this.modelId ? this.modelId : [],
    request.model,
    "ChatCompletionRequest",
  );

  if (request.stream) {
    // First let worker instantiate a generator
    const msg: WorkerRequest = {
      kind: "chatCompletionStreamInit",
      uuid: crypto.randomUUID(),
      content: {
        request: request,
        selectedModelId: selectedModelId,
        modelId: this.modelId,
        chatOpts: this.chatOpts,
      },
    };
    await this.getPromise<null>(msg);

    // Then return an async chunk generator that resides on the client side
    return this.asyncGenerate(selectedModelId) as AsyncGenerator<
      ChatCompletionChunk,
      void,
      void
    >;
  }

  // Non streaming case is more straightforward
  const msg: WorkerRequest = {
    kind: "chatCompletionNonStreaming",
    uuid: crypto.randomUUID(),
    content: {
      request: request,
      modelId: this.modelId,
      chatOpts: this.chatOpts,
    },
  };
  return await this.getPromise<ChatCompletion>(msg);
}

I/O Contract

Input: A ChatCompletionRequest object following the OpenAI API format, containing:

messages -- Array of ChatCompletionMessageParam (system, user, assistant, tool messages)
model (optional) -- Target model ID; required when multiple models are loaded
stream (optional) -- If true, returns an async iterable of chunks
temperature, top_p, max_tokens, stop, etc. -- Generation parameters
tools (optional) -- For function calling support
stream_options (optional) -- E.g., { include_usage: true } for usage stats in streaming

Output (non-streaming): A Promise<ChatCompletion> containing:

choices[].message.content -- The generated text
choices[].message.tool_calls -- Function calls (if tools were provided)
choices[].finish_reason -- Why generation stopped ("stop", "length", "tool_calls")
usage -- Token counts and performance metrics

Output (streaming): A Promise<AsyncIterable<ChatCompletionChunk>> yielding:

choices[].delta.content -- Incremental text for each chunk
choices[].delta.role -- "assistant"
choices[].finish_reason -- null until the final chunk

Error Conditions:

WorkerEngineModelNotLoadedError -- If reload() was not called
SpecifiedModelNotFoundError -- If the requested model is not among the loaded models
UnclearModelToUseError -- If multiple models are loaded but request.model is not specified

Import

import { CreateWebWorkerMLCEngine } from "@mlc-ai/web-llm";

// chatCompletion is accessed via the engine instance
const engine = await CreateWebWorkerMLCEngine(worker, modelId);
const result = await engine.chatCompletion(request);
// Or equivalently:
const result = await engine.chat.completions.create(request);

Usage Examples

Non-streaming chat completion:

const response = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is the capital of France?" },
  ],
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  temperature: 0.7,
  max_tokens: 256,
});

console.log(response.choices[0].message.content);
// "The capital of France is Paris."
console.log(response.usage);
// { prompt_tokens: 28, completion_tokens: 8, total_tokens: 36, extra: {...} }

Streaming chat completion:

const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Write a haiku about programming." }],
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  stream: true,
  stream_options: { include_usage: true },
});

let fullResponse = "";
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content || "";
  fullResponse += delta;
  // Update UI incrementally
  document.getElementById("output").textContent = fullResponse;
}

Function calling through the proxy:

const response = await engine.chat.completions.create({
  messages: [{ role: "user", content: "What's the weather in Paris?" }],
  model: "Hermes-2-Pro-Llama-3-8B-q4f16_1-MLC",
  tools: [{
    type: "function",
    function: {
      name: "get_weather",
      description: "Get the weather for a city",
      parameters: {
        type: "object",
        properties: { city: { type: "string" } },
        required: ["city"],
      },
    },
  }],
});

if (response.choices[0].message.tool_calls) {
  console.log(response.choices[0].message.tool_calls);
  // [{ id: "0", type: "function", function: { name: "get_weather", arguments: '{"city":"Paris"}' } }]
}

Related Pages

Principle:Mlc_ai_Web_llm_Cross_Thread_Request_Forwarding -- The principle this implements
Implementation:Mlc_ai_Web_llm_Web_Worker_MLC_Engine_Handler -- Worker-side handler that processes these requests
Implementation:Mlc_ai_Web_llm_Async_Generate -- The streaming generator used by the streaming path
Implementation:Mlc_ai_Web_llm_Create_Web_Worker_MLC_Engine -- Factory function that creates the engine proxy
Principle:Mlc_ai_Web_llm_Cross_Thread_Streaming -- Streaming-specific design

Principle:Mlc_ai_Web_llm_Cross_Thread_Request_Forwarding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment