Implementation:Mlc ai Web llm Chat Completion Response

Overview

ChatCompletion and ChatCompletionChunk are the TypeScript interfaces provided by @mlc-ai/web-llm for representing inference results. ChatCompletion (non-streaming) contains the full generated text in choices[].message.content plus usage statistics. ChatCompletionChunk (streaming) contains incremental text in choices[].delta.content. Both include CompletionUsage with standard token counts and WebLLM-specific performance metrics (prefill_tokens_per_s, decode_tokens_per_s, time_to_first_token_s, time_per_output_token_s).

Description

ChatCompletion (Non-Streaming Response)

Returned when stream is false or unset. Contains:

id -- A unique UUID for this completion
object -- Always "chat.completion"
created -- Unix timestamp (milliseconds) of when the completion was created
model -- The model_id of the model that generated the response
choices -- Array of ChatCompletion.Choice objects (one per n value)
usage -- CompletionUsage with token counts and performance metrics

Each ChatCompletion.Choice contains:

index -- Choice index (0-based)
message -- ChatCompletionMessage with role: "assistant" and content (or tool_calls for function calling)
finish_reason -- "stop", "length", "tool_calls", or "abort"
logprobs -- Log probability array if requested

ChatCompletionChunk (Streaming Response)

Yielded incrementally when stream: true. Contains:

id -- Same UUID across all chunks in a single generation
object -- Always "chat.completion.chunk"
created -- Same timestamp across all chunks
model -- The model that generated the response
choices -- Array of ChatCompletionChunk.Choice objects
usage -- Present only in the final usage chunk when stream_options: { include_usage: true }

Each ChatCompletionChunk.Choice contains:

index -- Choice index
delta -- Delta object with incremental content, optional role, and optional tool_calls
finish_reason -- null for intermediate chunks; set on the final chunk

CompletionUsage

Token usage and performance statistics:

completion_tokens -- Number of tokens generated
prompt_tokens -- Number of input tokens processed (for multi-round chats, only the new portion)
total_tokens -- Sum of completion_tokens and prompt_tokens
extra -- WebLLM-specific performance metrics:
- e2e_latency_s -- Total end-to-end latency in seconds
- prefill_tokens_per_s -- Prefill throughput
- decode_tokens_per_s -- Decode throughput
- time_to_first_token_s -- Seconds until the first token is generated
- time_per_output_token_s -- Average seconds per generated token
- grammar_init_s -- (Optional) Grammar matcher initialization time
- grammar_per_token_s -- (Optional) Per-token grammar processing time
- latencyBreakdown -- (Optional) Detailed per-stage timing if enable_latency_breakdown was set

Code Reference

Repository: https://github.com/mlc-ai/web-llm
File: src/openai_api_protocols/chat_completion.ts
ChatCompletion: Lines 312-356
ChatCompletionChunk: Lines 362-407
ChatCompletion.Choice: Lines 1038-1075
ChatCompletionChunk.Choice: Lines 1077-1168
CompletionUsage: Lines 955-1023
ChatCompletionFinishReason: Lines 1032-1036

Type Signatures

export interface ChatCompletion {
  id: string;
  choices: Array<ChatCompletion.Choice>;
  model: string;
  object: "chat.completion";
  created: number;
  usage?: CompletionUsage;
  system_fingerprint?: string;
}

export interface ChatCompletionChunk {
  id: string;
  choices: Array<ChatCompletionChunk.Choice>;
  created: number;
  model: string;
  object: "chat.completion.chunk";
  system_fingerprint?: string;
  usage?: CompletionUsage;
}

export interface CompletionUsage {
  completion_tokens: number;
  prompt_tokens: number;
  total_tokens: number;
  extra: {
    e2e_latency_s: number;
    prefill_tokens_per_s: number;
    decode_tokens_per_s: number;
    time_to_first_token_s: number;
    time_per_output_token_s: number;
    grammar_init_s?: number;
    grammar_per_token_s?: number;
    latencyBreakdown?: LatencyBreakdown;
  };
}

export type ChatCompletionFinishReason = "stop" | "length" | "tool_calls" | "abort";

Import

import {
  ChatCompletion,
  ChatCompletionChunk,
  CompletionUsage,
} from "@mlc-ai/web-llm";

I/O Contract

Direction	Name	Type	Description
Input	Response from `engine.chat.completions.create()`	`ChatCompletion` or `AsyncIterable<ChatCompletionChunk>`	The raw response object returned by the inference engine
Output	text	`string`	Extracted generated text from `choices[0].message.content` or concatenated `delta.content` values
Output	finish_reason	`ChatCompletionFinishReason`	Why generation stopped: `"stop"`, `"length"`, `"tool_calls"`, or `"abort"`
Output	usage	`CompletionUsage`	Token counts and performance metrics

Usage Example

Non-Streaming Response Processing

import { CreateMLCEngine, ChatCompletion } from "@mlc-ai/web-llm";

const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
  initProgressCallback: (p) => console.log(p.text),
});

const response: ChatCompletion = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Explain what a transformer is." },
  ],
  temperature: 0.7,
  max_tokens: 256,
});

// Extract the generated text
const generatedText = response.choices[0].message.content;
console.log("Generated text:", generatedText);

// Check why generation stopped
const finishReason = response.choices[0].finish_reason;
if (finishReason === "length") {
  console.log("Warning: response was truncated due to max_tokens limit.");
}

// Access usage statistics
if (response.usage) {
  console.log("Prompt tokens:", response.usage.prompt_tokens);
  console.log("Completion tokens:", response.usage.completion_tokens);
  console.log("Total tokens:", response.usage.total_tokens);
  console.log("Prefill speed:", response.usage.extra.prefill_tokens_per_s.toFixed(1), "tok/s");
  console.log("Decode speed:", response.usage.extra.decode_tokens_per_s.toFixed(1), "tok/s");
  console.log("Time to first token:", response.usage.extra.time_to_first_token_s.toFixed(3), "s");
  console.log("E2E latency:", response.usage.extra.e2e_latency_s.toFixed(3), "s");
}

// Access log probabilities (if requested)
if (response.choices[0].logprobs) {
  const tokenLogprobs = response.choices[0].logprobs.content;
  tokenLogprobs?.forEach((entry) => {
    console.log(`Token: "${entry.token}", logprob: ${entry.logprob}`);
  });
}

Streaming Response Processing

const stream = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Write a short poem about the ocean." },
  ],
  temperature: 0.8,
  max_tokens: 256,
  stream: true,
  stream_options: { include_usage: true },
});

let fullResponse = "";
let finishReason = "";

for await (const chunk of stream) {
  // Process content deltas
  const deltaContent = chunk.choices[0]?.delta?.content;
  if (deltaContent) {
    fullResponse += deltaContent;
    // Update UI in real-time
    document.getElementById("output").textContent = fullResponse;
  }

  // Check for finish reason in the final content chunk
  if (chunk.choices[0]?.finish_reason) {
    finishReason = chunk.choices[0].finish_reason;
    console.log("Finished with reason:", finishReason);
  }

  // Process usage statistics (final chunk with stream_options.include_usage)
  if (chunk.usage) {
    console.log("Prefill:", chunk.usage.extra.prefill_tokens_per_s.toFixed(1), "tok/s");
    console.log("Decode:", chunk.usage.extra.decode_tokens_per_s.toFixed(1), "tok/s");
    console.log("Total tokens:", chunk.usage.total_tokens);
  }
}

console.log("Full response:", fullResponse);

Related Pages

Principle:Mlc_ai_Web_llm_Streaming_Response_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment