Implementation:Mlc ai Web llm Async Generate
Overview
WebWorkerMLCEngine.asyncGenerate() is the proxy-side async generator method that bridges token streaming across the Web Worker thread boundary. It provides the AsyncGenerator<ChatCompletionChunk | Completion, void, void> returned to callers when they make streaming requests through the WebWorkerMLCEngine proxy.
Description
The asyncGenerate() method is an async * generator function on the WebWorkerMLCEngine class. It does not perform any inference itself. Instead, it acts as a pull-based bridge: each iteration sends a completionStreamNextChunk message to the worker, awaits the response, and yields the result to the caller.
The method is parameterized by selectedModelId, which identifies which model's generator to pull from on the worker side. This is essential for multi-model scenarios where the handler maintains separate generators for each loaded model.
The generator loop:
- Constructs a
WorkerRequestwithkind: "completionStreamNextChunk"and theselectedModelId - Sends it via
getPromise<ChatCompletionChunk>() - Checks the return type: if the result is an object, it is a valid chunk to yield; if not (i.e.,
void), the worker's generator is done - Breaks when the worker signals completion
The method is shared by both chatCompletion() (streaming) and completion() (streaming), since both use the same completionStreamNextChunk protocol. The actual type of the yielded value depends on which API initiated the stream. The proxy casts the return type accordingly when returning the generator to the caller:
chatCompletion()casts toAsyncGenerator<ChatCompletionChunk, void, void>completion()casts toAsyncGenerator<Completion, void, void>
Code Reference
Source: src/web_worker.ts, Lines 647-666
/**
* Every time the generator is called, we post a message to the worker asking it to
* decode one step, and we expect to receive a message of `ChatCompletionChunk` from
* the worker which we yield. The last message is `void`, meaning the generator has nothing
* to yield anymore.
*
* @param selectedModelId: The model of whose async generator to call next() to get next chunk.
* Needed because an engine can load multiple models.
*
* @note ChatCompletion and Completion share the same chunk generator.
*/
async *asyncGenerate(
selectedModelId: string,
): AsyncGenerator<ChatCompletionChunk | Completion, void, void> {
// Every time it gets called, sends message to worker, asking for the next chunk
while (true) {
const msg: WorkerRequest = {
kind: "completionStreamNextChunk",
uuid: crypto.randomUUID(),
content: {
selectedModelId: selectedModelId,
} as CompletionStreamNextChunkParams,
};
const ret = await this.getPromise<ChatCompletionChunk>(msg);
// If the worker's generator reached the end, it would return a `void`
if (typeof ret !== "object") {
break;
}
yield ret;
}
}
I/O Contract
Input (parameter):
| Parameter | Type | Description |
|---|---|---|
selectedModelId |
string |
The model ID identifying which worker-side generator to pull chunks from. Resolved by getModelIdToUse() before calling this method.
|
Output (yielded values):
Each iteration yields one of:
ChatCompletionChunk-- For chat completion streaming. Containschoices[].delta.content(incremental text),choices[].delta.role,choices[].finish_reason, and optionallyusage.Completion-- For text completion streaming. Containschoices[].text(incremental text),choices[].finish_reason.- Generator returns
voidwhen the stream is exhausted (no more chunks).
Messages sent to worker:
Each iteration sends a WorkerRequest:
{
kind: "completionStreamNextChunk",
uuid: "<random-uuid>",
content: {
selectedModelId: "Llama-3.1-8B-Instruct-q4f16_1-MLC"
}
}
Worker-side handling (for reference):
The worker handler's completionStreamNextChunk case:
case "completionStreamNextChunk": {
this.handleTask(msg.uuid, async () => {
const params = msg.content as CompletionStreamNextChunkParams;
const curGenerator = this.loadedModelIdToAsyncGenerator.get(
params.selectedModelId,
);
if (curGenerator === undefined) {
throw Error(
"InternalError: Chunk generator in worker should be instantiated by now.",
);
}
const { value } = await curGenerator.next();
return value;
});
return;
}
Error Conditions:
- If the worker's generator map does not have an entry for
selectedModelId, the worker throws"InternalError: Chunk generator in worker should be instantiated by now."-- this indicates a bug wherechatCompletionStreamInitwas not called beforecompletionStreamNextChunk. - Any engine-level errors during
generator.next()(e.g., device lost, OOM) are caught byhandleTaskand propagated as rejected promises.
Import
This method is not directly imported. It is called internally by WebWorkerMLCEngine.chatCompletion() and WebWorkerMLCEngine.completion() when the request has stream: true. Users interact with the generator through the standard AsyncIterable interface:
import { CreateWebWorkerMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateWebWorkerMLCEngine(worker, modelId);
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: "Hello" }],
stream: true,
});
// This iterates asyncGenerate() internally
for await (const chunk of stream) {
console.log(chunk.choices[0]?.delta?.content || "");
}
Usage Examples
Basic streaming with UI update:
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: "Explain recursion in simple terms." }],
model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
stream: true,
});
const outputElement = document.getElementById("response");
let fullText = "";
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content || "";
fullText += delta;
outputElement.textContent = fullText;
}
Streaming with usage statistics:
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: "Hello!" }],
model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
stream: true,
stream_options: { include_usage: true },
});
for await (const chunk of stream) {
if (chunk.usage) {
// Final usage chunk (empty choices array)
console.log("Tokens used:", chunk.usage.total_tokens);
console.log("Decode speed:", chunk.usage.extra.decode_tokens_per_s, "tok/s");
} else {
// Regular content chunk
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
}
Text completion streaming:
const stream = await engine.completions.create({
prompt: "The meaning of life is",
model: "Llama-3.1-8B-Instruct-q4f16_1-MLC",
stream: true,
max_tokens: 100,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.text || "");
}
Related Pages
- Principle:Mlc_ai_Web_llm_Cross_Thread_Streaming -- The principle this implements
- Implementation:Mlc_ai_Web_llm_Web_Worker_Chat_Completion -- The chat completion method that calls this generator
- Implementation:Mlc_ai_Web_llm_Web_Worker_MLC_Engine_Handler -- Worker-side handler with the real generator
- Implementation:Mlc_ai_Web_llm_Create_Web_Worker_MLC_Engine -- The proxy class this method belongs to
- Principle:Mlc_ai_Web_llm_Multi_Model_Routing -- How
selectedModelIdis resolved