Implementation:Mlc ai Web llm Async Generate

Overview

WebWorkerMLCEngine.asyncGenerate() is the proxy-side async generator method that bridges token streaming across the Web Worker thread boundary. It provides the AsyncGenerator<ChatCompletionChunk | Completion, void, void> returned to callers when they make streaming requests through the WebWorkerMLCEngine proxy.

Description

The asyncGenerate() method is an async * generator function on the WebWorkerMLCEngine class. It does not perform any inference itself. Instead, it acts as a pull-based bridge: each iteration sends a completionStreamNextChunk message to the worker, awaits the response, and yields the result to the caller.

The method is parameterized by selectedModelId, which identifies which model's generator to pull from on the worker side. This is essential for multi-model scenarios where the handler maintains separate generators for each loaded model.

The generator loop:

Constructs a WorkerRequest with kind: "completionStreamNextChunk" and the selectedModelId
Sends it via getPromise<ChatCompletionChunk>()
Checks the return type: if the result is an object, it is a valid chunk to yield; if not (i.e., void), the worker's generator is done
Breaks when the worker signals completion

The method is shared by both chatCompletion() (streaming) and completion() (streaming), since both use the same completionStreamNextChunk protocol. The actual type of the yielded value depends on which API initiated the stream. The proxy casts the return type accordingly when returning the generator to the caller:

chatCompletion() casts to AsyncGenerator<ChatCompletionChunk, void, void>
completion() casts to AsyncGenerator<Completion, void, void>

Code Reference

Source: src/web_worker.ts, Lines 647-666

/**
 * Every time the generator is called, we post a message to the worker asking it to
 * decode one step, and we expect to receive a message of `ChatCompletionChunk` from
 * the worker which we yield. The last message is `void`, meaning the generator has nothing
 * to yield anymore.
 *
 * @param selectedModelId: The model of whose async generator to call next() to get next chunk.
 *   Needed because an engine can load multiple models.
 *
 * @note ChatCompletion and Completion share the same chunk generator.
 */
async *asyncGenerate(
  selectedModelId: string,
): AsyncGenerator<ChatCompletionChunk | Completion, void, void> {
  // Every time it gets called, sends message to worker, asking for the next chunk
  while (true) {
    const msg: WorkerRequest = {
      kind: "completionStreamNextChunk",
      uuid: crypto.randomUUID(),
      content: {
        selectedModelId: selectedModelId,
      } as CompletionStreamNextChunkParams,
    };
    const ret = await this.getPromise<ChatCompletionChunk>(msg);
    // If the worker's generator reached the end, it would return a `void`
    if (typeof ret !== "object") {
      break;
    }
    yield ret;
  }
}

I/O Contract

Input (parameter):

Parameter	Type	Description
`selectedModelId`	`string`	The model ID identifying which worker-side generator to pull chunks from. Resolved by `getModelIdToUse()` before calling this method.

Output (yielded values):

Each iteration yields one of:

ChatCompletionChunk -- For chat completion streaming. Contains choices[].delta.content (incremental text), choices[].delta.role, choices[].finish_reason, and optionally usage.
Completion -- For text completion streaming. Contains choices[].text (incremental text), choices[].finish_reason.
Generator returns void when the stream is exhausted (no more chunks).

Messages sent to worker:

Each iteration sends a WorkerRequest:

{
  kind: "completionStreamNextChunk",
  uuid: "<random-uuid>",
  content: {
    selectedModelId: "Llama-3.1-8B-Instruct-q4f16_1-MLC"
  }
}

Worker-side handling (for reference):

The worker handler's completionStreamNextChunk case:

case "completionStreamNextChunk": {
  this.handleTask(msg.uuid, async () => {
    const params = msg.content as CompletionStreamNextChunkParams;
    const curGenerator = this.loadedModelIdToAsyncGenerator.get(
      params.selectedModelId,
    );
    if (curGenerator === undefined) {
      throw Error(
        "InternalError: Chunk generator in worker should be instantiated by now.",
      );
    }
    const { value } = await curGenerator.next();
    return value;
  });
  return;
}

Error Conditions:

If the worker's generator map does not have an entry for selectedModelId, the worker throws "InternalError: Chunk generator in worker should be instantiated by now." -- this indicates a bug where chatCompletionStreamInit was not called before completionStreamNextChunk.
Any engine-level errors during generator.next() (e.g., device lost, OOM) are caught by handleTask and propagated as rejected promises.

Import

This method is not directly imported. It is called internally by WebWorkerMLCEngine.chatCompletion() and WebWorkerMLCEngine.completion() when the request has stream: true. Users interact with the generator through the standard AsyncIterable interface:

import { CreateWebWorkerMLCEngine } from "@mlc-ai/web-llm";

const engine = await CreateWebWorkerMLCEngine(worker, modelId);
const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello" }],
  stream: true,
});

// This iterates asyncGenerate() internally
for await (const chunk of stream) {
  console.log(chunk.choices[0]?.delta?.content || "");
}

Usage Examples

Basic streaming with UI update:

const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Explain recursion in simple terms." }],
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  stream: true,
});

const outputElement = document.getElementById("response");
let fullText = "";
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content || "";
  fullText += delta;
  outputElement.textContent = fullText;
}

Streaming with usage statistics:

const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello!" }],
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  stream: true,
  stream_options: { include_usage: true },
});

for await (const chunk of stream) {
  if (chunk.usage) {
    // Final usage chunk (empty choices array)
    console.log("Tokens used:", chunk.usage.total_tokens);
    console.log("Decode speed:", chunk.usage.extra.decode_tokens_per_s, "tok/s");
  } else {
    // Regular content chunk
    process.stdout.write(chunk.choices[0]?.delta?.content || "");
  }
}

Text completion streaming:

const stream = await engine.completions.create({
  prompt: "The meaning of life is",
  model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  stream: true,
  max_tokens: 100,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.text || "");
}

Related Pages

Principle:Mlc_ai_Web_llm_Cross_Thread_Streaming -- The principle this implements
Implementation:Mlc_ai_Web_llm_Web_Worker_Chat_Completion -- The chat completion method that calls this generator
Implementation:Mlc_ai_Web_llm_Web_Worker_MLC_Engine_Handler -- Worker-side handler with the real generator
Implementation:Mlc_ai_Web_llm_Create_Web_Worker_MLC_Engine -- The proxy class this method belongs to
Principle:Mlc_ai_Web_llm_Multi_Model_Routing -- How selectedModelId is resolved

Principle:Mlc_ai_Web_llm_Cross_Thread_Streaming

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment