Principle:Mlc ai Web llm Chat Inference

Overview

Chat Inference is the process of generating text completions from a language model by iteratively predicting the next token given conversation context. In browser-based LLMs powered by WebGPU, this follows a prefill-decode architecture that splits the computation into two distinct phases optimized for different hardware characteristics.

Description

Chat inference in browser-based LLMs follows a two-phase autoregressive generation loop:

Prefill Phase

The prefill phase processes all input tokens in parallel to build the key-value (KV) cache. This phase:

Takes the full conversation context (system prompt + conversation history + current user message) as input
Processes all tokens through the transformer in a single batched forward pass
Populates the KV cache with attention key and value vectors for every input token
Produces the first output token
Is compute-bound -- dominated by batch matrix multiplication operations

Decode Phase

The decode phase generates tokens one at a time, autoregressively, using the KV cache for efficient attention computation. This phase:

Takes the most recently generated token as input
Computes attention over the existing KV cache entries (no need to reprocess earlier tokens)
Appends the new token's key and value vectors to the KV cache
Samples the next token from the model's output logit distribution according to configured sampling parameters
Repeats until a stop condition is met
Is memory-bound -- dominated by sequential attention lookups against the growing KV cache

Stop Conditions

Generation terminates when any of the following conditions is met:

stop -- The model generates a natural stop token or a user-specified stop sequence
length -- The generated output reaches max_tokens or the context window is exhausted
abort -- The user manually interrupts generation via engine.interruptGenerate()
tool_calls -- When function calling is active and the model finishes generating tool call output

Usage

Use chat inference as the core step after creating an engine and constructing a request. It supports two consumption modes:

Non-streaming -- The engine buffers all generated tokens internally and returns the complete response as a single ChatCompletion object. Use this when the complete response is needed before processing (e.g., JSON parsing, tool call extraction).
Streaming -- The engine yields ChatCompletionChunk objects as each token is generated, enabling real-time UI updates. Use this for interactive chat interfaces where perceived latency matters.

Multi-round conversation support is built into the inference flow. When the engine detects that the current request's conversation history matches the previous request (via compareConversationObject()), it reuses the existing KV cache and only processes the new user message in the prefill step. This optimization avoids redundant reprocessing of earlier conversation turns.

Theoretical Basis

Autoregressive language model inference operates as follows:

Prefill: Process input tokens [x_1, ..., x_n] through the transformer to build KV cache entries for each layer's attention mechanism. The KV cache stores the projected key (K) and value (V) matrices for all processed positions, enabling O(1) attention lookups per new token.

Decode: For each new token position n+1:
1. Compute the query vector Q_{n+1} from the previous token's embedding
2. Compute attention scores: attention = softmax(Q_{n+1} * K^T / sqrt(d_k))
3. Compute weighted value: output = attention * V
4. Pass through the feed-forward network and layer norm
5. Project to vocabulary logits
6. Sample token x_{n+1} from logits using configured sampling strategy

Sampling: The sampling process applies, in order:
1. Logit bias -- Additive bias from logit_bias parameter
2. Custom logit processors -- User-registered LogitProcessor functions
3. Repetition/frequency/presence penalties -- Multiplicative and additive adjustments based on token occurrence history
4. Temperature scaling -- Division of logits by temperature
5. Top-p (nucleus) filtering -- Zeroing out tokens outside the top-p probability mass
6. Categorical sampling -- Drawing a token from the resulting distribution

Termination check: Compare generated token against stop tokens, check max_tokens, check context window overflow, and check interrupt signal.

Performance Characteristics

Prefill throughput is measured in tokens/second for the input processing phase and is typically much higher than decode throughput due to parallelism
Decode throughput is measured in tokens/second for the generation phase and is bottlenecked by memory bandwidth
Time to first token is dominated by the prefill phase and grows with input length
Time per output token corresponds to a single decode step

These metrics are reported in CompletionUsage.extra as prefill_tokens_per_s, decode_tokens_per_s, time_to_first_token_s, and time_per_output_token_s.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment