Implementation:Mlc ai Web llm Chat Completions Create

Overview

engine.chat.completions.create() is the primary inference method provided by @mlc-ai/web-llm. It accepts an OpenAI-compatible ChatCompletionRequest and returns either a ChatCompletion (non-streaming) or an AsyncIterable<ChatCompletionChunk> (streaming). Internally, it delegates to MLCEngine.chatCompletion() which validates the request, formats the conversation, acquires a per-model concurrency lock, and runs the prefill-decode inference loop through LLMChatPipeline.

Description

The inference pipeline proceeds through the following stages:

1. Request Validation and Preprocessing

Resolves which loaded model to use (required when multiple models are loaded)
Validates that the correct pipeline type is loaded (LLMChatPipeline, not EmbeddingPipeline)
Calls postInitAndCheckFieldsChatCompletion() to validate message ordering, field constraints, and tool calling compatibility
Extracts generation parameters into a GenerationConfig object

2. Concurrency Lock Acquisition

Each loaded model has a CustomLock instance. The engine acquires this lock before starting inference, ensuring that each model processes only one request at a time. This prevents race conditions in GPU operations.

3. Conversation State Management

During prefill, the engine:

Constructs a new Conversation object from the request's messages (excluding the last message)
Compares it with the pipeline's existing conversation state via compareConversationObject()
If they match (multi-round chat), reuses the KV cache and only prefills the new user message
If they differ, resets the pipeline (clearing KV cache) and sets the new conversation

4. Prefill and Decode

Prefill: pipeline.prefillStep() processes the input tokens through the model
Decode loop: Repeatedly calls pipeline.decodeStep() until pipeline.stopped() returns true or the interrupt signal is set
For streaming, each decode step yields a ChatCompletionChunk with the incremental delta
For non-streaming, the full output is collected and returned as a ChatCompletion

5. Post-processing

For function calling requests, parses the output message as JSON tool calls
Computes usage statistics (token counts, throughput metrics)
Releases the concurrency lock

Code Reference

Repository: https://github.com/mlc-ai/web-llm
File: src/openai_api_protocols/chat_completion.ts (Completions proxy class, lines 60-78)
File: src/engine.ts (chatCompletion(), lines 767-945; asyncGenerate(), lines 480-749; _generate(), lines 437-459; prefill(), lines 1346-1404; decode(), lines 1409-1411)

Type Signatures

// Completions proxy class in src/openai_api_protocols/chat_completion.ts
export class Completions {
  create(request: ChatCompletionRequestNonStreaming): Promise<ChatCompletion>;
  create(request: ChatCompletionRequestStreaming): Promise<AsyncIterable<ChatCompletionChunk>>;
  create(request: ChatCompletionRequestBase): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>;
  create(request: ChatCompletionRequest): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>;
}

// MLCEngine.chatCompletion() in src/engine.ts
async chatCompletion(request: ChatCompletionRequestNonStreaming): Promise<ChatCompletion>;
async chatCompletion(request: ChatCompletionRequestStreaming): Promise<AsyncIterable<ChatCompletionChunk>>;
async chatCompletion(request: ChatCompletionRequest): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>;

Import

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// The completions API is accessed via the engine instance:
// engine.chat.completions.create(request)

I/O Contract

Direction	Name	Type	Required	Description
Input	request	`ChatCompletionRequest`	Yes	OpenAI-compatible request object with messages and generation parameters
Output (non-streaming)	response	`Promise<ChatCompletion>`	--	Complete response with choices, message content, and usage statistics
Output (streaming)	chunks	`Promise<AsyncIterable<ChatCompletionChunk>>`	--	Async iterable yielding incremental chunks with delta content

Usage Example

Non-Streaming

import { CreateMLCEngine } from "@mlc-ai/web-llm";

const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
  initProgressCallback: (p) => console.log(p.text),
});

const response = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is WebGPU?" },
  ],
  temperature: 0.7,
  max_tokens: 256,
});

console.log("Response:", response.choices[0].message.content);
console.log("Finish reason:", response.choices[0].finish_reason);
console.log("Tokens used:", response.usage?.total_tokens);
console.log("Decode speed:", response.usage?.extra.decode_tokens_per_s, "tok/s");

Streaming

const stream = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Explain quantum computing in simple terms." },
  ],
  temperature: 0.7,
  max_tokens: 512,
  stream: true,
  stream_options: { include_usage: true },
});

let fullResponse = "";
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) {
    fullResponse += delta;
    process.stdout.write(delta);  // Print token by token
  }
  if (chunk.usage) {
    console.log("\nPrefill speed:", chunk.usage.extra.prefill_tokens_per_s, "tok/s");
    console.log("Decode speed:", chunk.usage.extra.decode_tokens_per_s, "tok/s");
  }
}

Multi-Round Conversation

// First turn
const reply1 = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a math tutor." },
    { role: "user", content: "What is 2 + 2?" },
  ],
});

// Second turn -- web-llm reuses KV cache from first turn
const reply2 = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a math tutor." },
    { role: "user", content: "What is 2 + 2?" },
    { role: "assistant", content: reply1.choices[0].message.content! },
    { role: "user", content: "Now multiply that by 3." },
  ],
});

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment