Implementation:Mlc ai Web llm Chat Completion Response
Overview
ChatCompletion and ChatCompletionChunk are the TypeScript interfaces provided by @mlc-ai/web-llm for representing inference results. ChatCompletion (non-streaming) contains the full generated text in choices[].message.content plus usage statistics. ChatCompletionChunk (streaming) contains incremental text in choices[].delta.content. Both include CompletionUsage with standard token counts and WebLLM-specific performance metrics (prefill_tokens_per_s, decode_tokens_per_s, time_to_first_token_s, time_per_output_token_s).
Description
ChatCompletion (Non-Streaming Response)
Returned when stream is false or unset. Contains:
- id -- A unique UUID for this completion
- object -- Always
"chat.completion" - created -- Unix timestamp (milliseconds) of when the completion was created
- model -- The
model_idof the model that generated the response - choices -- Array of
ChatCompletion.Choiceobjects (one pernvalue) - usage --
CompletionUsagewith token counts and performance metrics
Each ChatCompletion.Choice contains:
- index -- Choice index (0-based)
- message --
ChatCompletionMessagewithrole: "assistant"andcontent(ortool_callsfor function calling) - finish_reason --
"stop","length","tool_calls", or"abort" - logprobs -- Log probability array if requested
ChatCompletionChunk (Streaming Response)
Yielded incrementally when stream: true. Contains:
- id -- Same UUID across all chunks in a single generation
- object -- Always
"chat.completion.chunk" - created -- Same timestamp across all chunks
- model -- The model that generated the response
- choices -- Array of
ChatCompletionChunk.Choiceobjects - usage -- Present only in the final usage chunk when
stream_options: { include_usage: true }
Each ChatCompletionChunk.Choice contains:
- index -- Choice index
- delta --
Deltaobject with incrementalcontent, optionalrole, and optionaltool_calls - finish_reason --
nullfor intermediate chunks; set on the final chunk
CompletionUsage
Token usage and performance statistics:
- completion_tokens -- Number of tokens generated
- prompt_tokens -- Number of input tokens processed (for multi-round chats, only the new portion)
- total_tokens -- Sum of completion_tokens and prompt_tokens
- extra -- WebLLM-specific performance metrics:
- e2e_latency_s -- Total end-to-end latency in seconds
- prefill_tokens_per_s -- Prefill throughput
- decode_tokens_per_s -- Decode throughput
- time_to_first_token_s -- Seconds until the first token is generated
- time_per_output_token_s -- Average seconds per generated token
- grammar_init_s -- (Optional) Grammar matcher initialization time
- grammar_per_token_s -- (Optional) Per-token grammar processing time
- latencyBreakdown -- (Optional) Detailed per-stage timing if
enable_latency_breakdownwas set
Code Reference
- Repository: https://github.com/mlc-ai/web-llm
- File:
src/openai_api_protocols/chat_completion.ts - ChatCompletion: Lines 312-356
- ChatCompletionChunk: Lines 362-407
- ChatCompletion.Choice: Lines 1038-1075
- ChatCompletionChunk.Choice: Lines 1077-1168
- CompletionUsage: Lines 955-1023
- ChatCompletionFinishReason: Lines 1032-1036
Type Signatures
export interface ChatCompletion {
id: string;
choices: Array<ChatCompletion.Choice>;
model: string;
object: "chat.completion";
created: number;
usage?: CompletionUsage;
system_fingerprint?: string;
}
export interface ChatCompletionChunk {
id: string;
choices: Array<ChatCompletionChunk.Choice>;
created: number;
model: string;
object: "chat.completion.chunk";
system_fingerprint?: string;
usage?: CompletionUsage;
}
export interface CompletionUsage {
completion_tokens: number;
prompt_tokens: number;
total_tokens: number;
extra: {
e2e_latency_s: number;
prefill_tokens_per_s: number;
decode_tokens_per_s: number;
time_to_first_token_s: number;
time_per_output_token_s: number;
grammar_init_s?: number;
grammar_per_token_s?: number;
latencyBreakdown?: LatencyBreakdown;
};
}
export type ChatCompletionFinishReason = "stop" | "length" | "tool_calls" | "abort";
Import
import {
ChatCompletion,
ChatCompletionChunk,
CompletionUsage,
} from "@mlc-ai/web-llm";
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | Response from engine.chat.completions.create() |
ChatCompletion or AsyncIterable<ChatCompletionChunk> |
The raw response object returned by the inference engine |
| Output | text | string |
Extracted generated text from choices[0].message.content or concatenated delta.content values
|
| Output | finish_reason | ChatCompletionFinishReason |
Why generation stopped: "stop", "length", "tool_calls", or "abort"
|
| Output | usage | CompletionUsage |
Token counts and performance metrics |
Usage Example
Non-Streaming Response Processing
import { CreateMLCEngine, ChatCompletion } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
initProgressCallback: (p) => console.log(p.text),
});
const response: ChatCompletion = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain what a transformer is." },
],
temperature: 0.7,
max_tokens: 256,
});
// Extract the generated text
const generatedText = response.choices[0].message.content;
console.log("Generated text:", generatedText);
// Check why generation stopped
const finishReason = response.choices[0].finish_reason;
if (finishReason === "length") {
console.log("Warning: response was truncated due to max_tokens limit.");
}
// Access usage statistics
if (response.usage) {
console.log("Prompt tokens:", response.usage.prompt_tokens);
console.log("Completion tokens:", response.usage.completion_tokens);
console.log("Total tokens:", response.usage.total_tokens);
console.log("Prefill speed:", response.usage.extra.prefill_tokens_per_s.toFixed(1), "tok/s");
console.log("Decode speed:", response.usage.extra.decode_tokens_per_s.toFixed(1), "tok/s");
console.log("Time to first token:", response.usage.extra.time_to_first_token_s.toFixed(3), "s");
console.log("E2E latency:", response.usage.extra.e2e_latency_s.toFixed(3), "s");
}
// Access log probabilities (if requested)
if (response.choices[0].logprobs) {
const tokenLogprobs = response.choices[0].logprobs.content;
tokenLogprobs?.forEach((entry) => {
console.log(`Token: "${entry.token}", logprob: ${entry.logprob}`);
});
}
Streaming Response Processing
const stream = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Write a short poem about the ocean." },
],
temperature: 0.8,
max_tokens: 256,
stream: true,
stream_options: { include_usage: true },
});
let fullResponse = "";
let finishReason = "";
for await (const chunk of stream) {
// Process content deltas
const deltaContent = chunk.choices[0]?.delta?.content;
if (deltaContent) {
fullResponse += deltaContent;
// Update UI in real-time
document.getElementById("output").textContent = fullResponse;
}
// Check for finish reason in the final content chunk
if (chunk.choices[0]?.finish_reason) {
finishReason = chunk.choices[0].finish_reason;
console.log("Finished with reason:", finishReason);
}
// Process usage statistics (final chunk with stream_options.include_usage)
if (chunk.usage) {
console.log("Prefill:", chunk.usage.extra.prefill_tokens_per_s.toFixed(1), "tok/s");
console.log("Decode:", chunk.usage.extra.decode_tokens_per_s.toFixed(1), "tok/s");
console.log("Total tokens:", chunk.usage.total_tokens);
}
}
console.log("Full response:", fullResponse);