Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Web llm Chat Completions Create

From Leeroopedia

Overview

engine.chat.completions.create() is the primary inference method provided by @mlc-ai/web-llm. It accepts an OpenAI-compatible ChatCompletionRequest and returns either a ChatCompletion (non-streaming) or an AsyncIterable<ChatCompletionChunk> (streaming). Internally, it delegates to MLCEngine.chatCompletion() which validates the request, formats the conversation, acquires a per-model concurrency lock, and runs the prefill-decode inference loop through LLMChatPipeline.

Description

The inference pipeline proceeds through the following stages:

1. Request Validation and Preprocessing

  • Resolves which loaded model to use (required when multiple models are loaded)
  • Validates that the correct pipeline type is loaded (LLMChatPipeline, not EmbeddingPipeline)
  • Calls postInitAndCheckFieldsChatCompletion() to validate message ordering, field constraints, and tool calling compatibility
  • Extracts generation parameters into a GenerationConfig object

2. Concurrency Lock Acquisition

Each loaded model has a CustomLock instance. The engine acquires this lock before starting inference, ensuring that each model processes only one request at a time. This prevents race conditions in GPU operations.

3. Conversation State Management

During prefill, the engine:

  • Constructs a new Conversation object from the request's messages (excluding the last message)
  • Compares it with the pipeline's existing conversation state via compareConversationObject()
  • If they match (multi-round chat), reuses the KV cache and only prefills the new user message
  • If they differ, resets the pipeline (clearing KV cache) and sets the new conversation

4. Prefill and Decode

  • Prefill: pipeline.prefillStep() processes the input tokens through the model
  • Decode loop: Repeatedly calls pipeline.decodeStep() until pipeline.stopped() returns true or the interrupt signal is set
  • For streaming, each decode step yields a ChatCompletionChunk with the incremental delta
  • For non-streaming, the full output is collected and returned as a ChatCompletion

5. Post-processing

  • For function calling requests, parses the output message as JSON tool calls
  • Computes usage statistics (token counts, throughput metrics)
  • Releases the concurrency lock

Code Reference

  • Repository: https://github.com/mlc-ai/web-llm
  • File: src/openai_api_protocols/chat_completion.ts (Completions proxy class, lines 60-78)
  • File: src/engine.ts (chatCompletion(), lines 767-945; asyncGenerate(), lines 480-749; _generate(), lines 437-459; prefill(), lines 1346-1404; decode(), lines 1409-1411)

Type Signatures

// Completions proxy class in src/openai_api_protocols/chat_completion.ts
export class Completions {
  create(request: ChatCompletionRequestNonStreaming): Promise<ChatCompletion>;
  create(request: ChatCompletionRequestStreaming): Promise<AsyncIterable<ChatCompletionChunk>>;
  create(request: ChatCompletionRequestBase): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>;
  create(request: ChatCompletionRequest): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>;
}

// MLCEngine.chatCompletion() in src/engine.ts
async chatCompletion(request: ChatCompletionRequestNonStreaming): Promise<ChatCompletion>;
async chatCompletion(request: ChatCompletionRequestStreaming): Promise<AsyncIterable<ChatCompletionChunk>>;
async chatCompletion(request: ChatCompletionRequest): Promise<AsyncIterable<ChatCompletionChunk> | ChatCompletion>;

Import

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// The completions API is accessed via the engine instance:
// engine.chat.completions.create(request)

I/O Contract

Direction Name Type Required Description
Input request ChatCompletionRequest Yes OpenAI-compatible request object with messages and generation parameters
Output (non-streaming) response Promise<ChatCompletion> -- Complete response with choices, message content, and usage statistics
Output (streaming) chunks Promise<AsyncIterable<ChatCompletionChunk>> -- Async iterable yielding incremental chunks with delta content

Usage Example

Non-Streaming

import { CreateMLCEngine } from "@mlc-ai/web-llm";

const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
  initProgressCallback: (p) => console.log(p.text),
});

const response = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "What is WebGPU?" },
  ],
  temperature: 0.7,
  max_tokens: 256,
});

console.log("Response:", response.choices[0].message.content);
console.log("Finish reason:", response.choices[0].finish_reason);
console.log("Tokens used:", response.usage?.total_tokens);
console.log("Decode speed:", response.usage?.extra.decode_tokens_per_s, "tok/s");

Streaming

const stream = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Explain quantum computing in simple terms." },
  ],
  temperature: 0.7,
  max_tokens: 512,
  stream: true,
  stream_options: { include_usage: true },
});

let fullResponse = "";
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) {
    fullResponse += delta;
    process.stdout.write(delta);  // Print token by token
  }
  if (chunk.usage) {
    console.log("\nPrefill speed:", chunk.usage.extra.prefill_tokens_per_s, "tok/s");
    console.log("Decode speed:", chunk.usage.extra.decode_tokens_per_s, "tok/s");
  }
}

Multi-Round Conversation

// First turn
const reply1 = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a math tutor." },
    { role: "user", content: "What is 2 + 2?" },
  ],
});

// Second turn -- web-llm reuses KV cache from first turn
const reply2 = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a math tutor." },
    { role: "user", content: "What is 2 + 2?" },
    { role: "assistant", content: reply1.choices[0].message.content! },
    { role: "user", content: "Now multiply that by 3." },
  ],
});

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment