Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Web llm Chat Completion Response

From Leeroopedia

Overview

ChatCompletion and ChatCompletionChunk are the TypeScript interfaces provided by @mlc-ai/web-llm for representing inference results. ChatCompletion (non-streaming) contains the full generated text in choices[].message.content plus usage statistics. ChatCompletionChunk (streaming) contains incremental text in choices[].delta.content. Both include CompletionUsage with standard token counts and WebLLM-specific performance metrics (prefill_tokens_per_s, decode_tokens_per_s, time_to_first_token_s, time_per_output_token_s).

Description

ChatCompletion (Non-Streaming Response)

Returned when stream is false or unset. Contains:

  • id -- A unique UUID for this completion
  • object -- Always "chat.completion"
  • created -- Unix timestamp (milliseconds) of when the completion was created
  • model -- The model_id of the model that generated the response
  • choices -- Array of ChatCompletion.Choice objects (one per n value)
  • usage -- CompletionUsage with token counts and performance metrics

Each ChatCompletion.Choice contains:

  • index -- Choice index (0-based)
  • message -- ChatCompletionMessage with role: "assistant" and content (or tool_calls for function calling)
  • finish_reason -- "stop", "length", "tool_calls", or "abort"
  • logprobs -- Log probability array if requested

ChatCompletionChunk (Streaming Response)

Yielded incrementally when stream: true. Contains:

  • id -- Same UUID across all chunks in a single generation
  • object -- Always "chat.completion.chunk"
  • created -- Same timestamp across all chunks
  • model -- The model that generated the response
  • choices -- Array of ChatCompletionChunk.Choice objects
  • usage -- Present only in the final usage chunk when stream_options: { include_usage: true }

Each ChatCompletionChunk.Choice contains:

  • index -- Choice index
  • delta -- Delta object with incremental content, optional role, and optional tool_calls
  • finish_reason -- null for intermediate chunks; set on the final chunk

CompletionUsage

Token usage and performance statistics:

  • completion_tokens -- Number of tokens generated
  • prompt_tokens -- Number of input tokens processed (for multi-round chats, only the new portion)
  • total_tokens -- Sum of completion_tokens and prompt_tokens
  • extra -- WebLLM-specific performance metrics:
    • e2e_latency_s -- Total end-to-end latency in seconds
    • prefill_tokens_per_s -- Prefill throughput
    • decode_tokens_per_s -- Decode throughput
    • time_to_first_token_s -- Seconds until the first token is generated
    • time_per_output_token_s -- Average seconds per generated token
    • grammar_init_s -- (Optional) Grammar matcher initialization time
    • grammar_per_token_s -- (Optional) Per-token grammar processing time
    • latencyBreakdown -- (Optional) Detailed per-stage timing if enable_latency_breakdown was set

Code Reference

  • Repository: https://github.com/mlc-ai/web-llm
  • File: src/openai_api_protocols/chat_completion.ts
  • ChatCompletion: Lines 312-356
  • ChatCompletionChunk: Lines 362-407
  • ChatCompletion.Choice: Lines 1038-1075
  • ChatCompletionChunk.Choice: Lines 1077-1168
  • CompletionUsage: Lines 955-1023
  • ChatCompletionFinishReason: Lines 1032-1036

Type Signatures

export interface ChatCompletion {
  id: string;
  choices: Array<ChatCompletion.Choice>;
  model: string;
  object: "chat.completion";
  created: number;
  usage?: CompletionUsage;
  system_fingerprint?: string;
}

export interface ChatCompletionChunk {
  id: string;
  choices: Array<ChatCompletionChunk.Choice>;
  created: number;
  model: string;
  object: "chat.completion.chunk";
  system_fingerprint?: string;
  usage?: CompletionUsage;
}

export interface CompletionUsage {
  completion_tokens: number;
  prompt_tokens: number;
  total_tokens: number;
  extra: {
    e2e_latency_s: number;
    prefill_tokens_per_s: number;
    decode_tokens_per_s: number;
    time_to_first_token_s: number;
    time_per_output_token_s: number;
    grammar_init_s?: number;
    grammar_per_token_s?: number;
    latencyBreakdown?: LatencyBreakdown;
  };
}

export type ChatCompletionFinishReason = "stop" | "length" | "tool_calls" | "abort";

Import

import {
  ChatCompletion,
  ChatCompletionChunk,
  CompletionUsage,
} from "@mlc-ai/web-llm";

I/O Contract

Direction Name Type Description
Input Response from engine.chat.completions.create() ChatCompletion or AsyncIterable<ChatCompletionChunk> The raw response object returned by the inference engine
Output text string Extracted generated text from choices[0].message.content or concatenated delta.content values
Output finish_reason ChatCompletionFinishReason Why generation stopped: "stop", "length", "tool_calls", or "abort"
Output usage CompletionUsage Token counts and performance metrics

Usage Example

Non-Streaming Response Processing

import { CreateMLCEngine, ChatCompletion } from "@mlc-ai/web-llm";

const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
  initProgressCallback: (p) => console.log(p.text),
});

const response: ChatCompletion = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Explain what a transformer is." },
  ],
  temperature: 0.7,
  max_tokens: 256,
});

// Extract the generated text
const generatedText = response.choices[0].message.content;
console.log("Generated text:", generatedText);

// Check why generation stopped
const finishReason = response.choices[0].finish_reason;
if (finishReason === "length") {
  console.log("Warning: response was truncated due to max_tokens limit.");
}

// Access usage statistics
if (response.usage) {
  console.log("Prompt tokens:", response.usage.prompt_tokens);
  console.log("Completion tokens:", response.usage.completion_tokens);
  console.log("Total tokens:", response.usage.total_tokens);
  console.log("Prefill speed:", response.usage.extra.prefill_tokens_per_s.toFixed(1), "tok/s");
  console.log("Decode speed:", response.usage.extra.decode_tokens_per_s.toFixed(1), "tok/s");
  console.log("Time to first token:", response.usage.extra.time_to_first_token_s.toFixed(3), "s");
  console.log("E2E latency:", response.usage.extra.e2e_latency_s.toFixed(3), "s");
}

// Access log probabilities (if requested)
if (response.choices[0].logprobs) {
  const tokenLogprobs = response.choices[0].logprobs.content;
  tokenLogprobs?.forEach((entry) => {
    console.log(`Token: "${entry.token}", logprob: ${entry.logprob}`);
  });
}

Streaming Response Processing

const stream = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Write a short poem about the ocean." },
  ],
  temperature: 0.8,
  max_tokens: 256,
  stream: true,
  stream_options: { include_usage: true },
});

let fullResponse = "";
let finishReason = "";

for await (const chunk of stream) {
  // Process content deltas
  const deltaContent = chunk.choices[0]?.delta?.content;
  if (deltaContent) {
    fullResponse += deltaContent;
    // Update UI in real-time
    document.getElementById("output").textContent = fullResponse;
  }

  // Check for finish reason in the final content chunk
  if (chunk.choices[0]?.finish_reason) {
    finishReason = chunk.choices[0].finish_reason;
    console.log("Finished with reason:", finishReason);
  }

  // Process usage statistics (final chunk with stream_options.include_usage)
  if (chunk.usage) {
    console.log("Prefill:", chunk.usage.extra.prefill_tokens_per_s.toFixed(1), "tok/s");
    console.log("Decode:", chunk.usage.extra.decode_tokens_per_s.toFixed(1), "tok/s");
    console.log("Total tokens:", chunk.usage.total_tokens);
  }
}

console.log("Full response:", fullResponse);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment