Implementation:Mlc ai Web llm Chat Completions Create
Overview
engine.chat.completions.create() is the primary inference method provided by @mlc-ai/web-llm. It accepts an OpenAI-compatible ChatCompletionRequest and returns either a ChatCompletion (non-streaming) or an AsyncIterable<ChatCompletionChunk> (streaming). Internally, it delegates to MLCEngine.chatCompletion() which validates the request, formats the conversation, acquires a per-model concurrency lock, and runs the prefill-decode inference loop through LLMChatPipeline.
Description
The inference pipeline proceeds through the following stages:
1. Request Validation and Preprocessing
- Resolves which loaded model to use (required when multiple models are loaded)
- Validates that the correct pipeline type is loaded (
LLMChatPipeline, notEmbeddingPipeline) - Calls
postInitAndCheckFieldsChatCompletion()to validate message ordering, field constraints, and tool calling compatibility - Extracts generation parameters into a
GenerationConfigobject
2. Concurrency Lock Acquisition
Each loaded model has a CustomLock instance. The engine acquires this lock before starting inference, ensuring that each model processes only one request at a time. This prevents race conditions in GPU operations.
3. Conversation State Management
During prefill, the engine:
- Constructs a new
Conversationobject from the request's messages (excluding the last message) - Compares it with the pipeline's existing conversation state via
compareConversationObject() - If they match (multi-round chat), reuses the KV cache and only prefills the new user message
- If they differ, resets the pipeline (clearing KV cache) and sets the new conversation
4. Prefill and Decode
- Prefill:
pipeline.prefillStep()processes the input tokens through the model - Decode loop: Repeatedly calls
pipeline.decodeStep()untilpipeline.stopped()returns true or the interrupt signal is set - For streaming, each decode step yields a
ChatCompletionChunkwith the incremental delta - For non-streaming, the full output is collected and returned as a
ChatCompletion
5. Post-processing
- For function calling requests, parses the output message as JSON tool calls
- Computes usage statistics (token counts, throughput metrics)
- Releases the concurrency lock
Code Reference
- Repository: https://github.com/mlc-ai/web-llm
- File:
src/openai_api_protocols/chat_completion.ts(Completions proxy class, lines 60-78) - File:
src/engine.ts(chatCompletion(), lines 767-945;asyncGenerate(), lines 480-749;_generate(), lines 437-459;prefill(), lines 1346-1404;decode(), lines 1409-1411)
Type Signatures
// Completions proxy class in src/openai_api_protocols/chat_completion.ts
export class Completions {
create(request: ChatCompletionRequestNonStreaming): Promise<ChatCompletion>;
create(request: ChatCompletionRequestStreaming): Promise<AsyncIterable<ChatCompletionChunk>>;
create(request: ChatCompletionRequestBase): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>;
create(request: ChatCompletionRequest): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>;
}
// MLCEngine.chatCompletion() in src/engine.ts
async chatCompletion(request: ChatCompletionRequestNonStreaming): Promise<ChatCompletion>;
async chatCompletion(request: ChatCompletionRequestStreaming): Promise<AsyncIterable<ChatCompletionChunk>>;
async chatCompletion(request: ChatCompletionRequest): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>;
Import
import { CreateMLCEngine } from "@mlc-ai/web-llm";
// The completions API is accessed via the engine instance:
// engine.chat.completions.create(request)
I/O Contract
| Direction | Name | Type | Required | Description |
|---|---|---|---|---|
| Input | request | ChatCompletionRequest |
Yes | OpenAI-compatible request object with messages and generation parameters |
| Output (non-streaming) | response | Promise<ChatCompletion> |
-- | Complete response with choices, message content, and usage statistics |
| Output (streaming) | chunks | Promise<AsyncIterable<ChatCompletionChunk>> |
-- | Async iterable yielding incremental chunks with delta content |
Usage Example
Non-Streaming
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
initProgressCallback: (p) => console.log(p.text),
});
const response = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is WebGPU?" },
],
temperature: 0.7,
max_tokens: 256,
});
console.log("Response:", response.choices[0].message.content);
console.log("Finish reason:", response.choices[0].finish_reason);
console.log("Tokens used:", response.usage?.total_tokens);
console.log("Decode speed:", response.usage?.extra.decode_tokens_per_s, "tok/s");
Streaming
const stream = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing in simple terms." },
],
temperature: 0.7,
max_tokens: 512,
stream: true,
stream_options: { include_usage: true },
});
let fullResponse = "";
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) {
fullResponse += delta;
process.stdout.write(delta); // Print token by token
}
if (chunk.usage) {
console.log("\nPrefill speed:", chunk.usage.extra.prefill_tokens_per_s, "tok/s");
console.log("Decode speed:", chunk.usage.extra.decode_tokens_per_s, "tok/s");
}
}
Multi-Round Conversation
// First turn
const reply1 = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a math tutor." },
{ role: "user", content: "What is 2 + 2?" },
],
});
// Second turn -- web-llm reuses KV cache from first turn
const reply2 = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a math tutor." },
{ role: "user", content: "What is 2 + 2?" },
{ role: "assistant", content: reply1.choices[0].message.content! },
{ role: "user", content: "Now multiply that by 3." },
],
});