Principle:Mlc ai Web llm Chat Inference
Overview
Chat Inference is the process of generating text completions from a language model by iteratively predicting the next token given conversation context. In browser-based LLMs powered by WebGPU, this follows a prefill-decode architecture that splits the computation into two distinct phases optimized for different hardware characteristics.
Description
Chat inference in browser-based LLMs follows a two-phase autoregressive generation loop:
Prefill Phase
The prefill phase processes all input tokens in parallel to build the key-value (KV) cache. This phase:
- Takes the full conversation context (system prompt + conversation history + current user message) as input
- Processes all tokens through the transformer in a single batched forward pass
- Populates the KV cache with attention key and value vectors for every input token
- Produces the first output token
- Is compute-bound -- dominated by batch matrix multiplication operations
Decode Phase
The decode phase generates tokens one at a time, autoregressively, using the KV cache for efficient attention computation. This phase:
- Takes the most recently generated token as input
- Computes attention over the existing KV cache entries (no need to reprocess earlier tokens)
- Appends the new token's key and value vectors to the KV cache
- Samples the next token from the model's output logit distribution according to configured sampling parameters
- Repeats until a stop condition is met
- Is memory-bound -- dominated by sequential attention lookups against the growing KV cache
Stop Conditions
Generation terminates when any of the following conditions is met:
- stop -- The model generates a natural stop token or a user-specified stop sequence
- length -- The generated output reaches
max_tokensor the context window is exhausted - abort -- The user manually interrupts generation via
engine.interruptGenerate() - tool_calls -- When function calling is active and the model finishes generating tool call output
Usage
Use chat inference as the core step after creating an engine and constructing a request. It supports two consumption modes:
- Non-streaming -- The engine buffers all generated tokens internally and returns the complete response as a single
ChatCompletionobject. Use this when the complete response is needed before processing (e.g., JSON parsing, tool call extraction). - Streaming -- The engine yields
ChatCompletionChunkobjects as each token is generated, enabling real-time UI updates. Use this for interactive chat interfaces where perceived latency matters.
Multi-round conversation support is built into the inference flow. When the engine detects that the current request's conversation history matches the previous request (via compareConversationObject()), it reuses the existing KV cache and only processes the new user message in the prefill step. This optimization avoids redundant reprocessing of earlier conversation turns.
Theoretical Basis
Autoregressive language model inference operates as follows:
- Prefill: Process input tokens [x_1, ..., x_n] through the transformer to build KV cache entries for each layer's attention mechanism. The KV cache stores the projected key (K) and value (V) matrices for all processed positions, enabling O(1) attention lookups per new token.
- Decode: For each new token position n+1:
- Compute the query vector Q_{n+1} from the previous token's embedding
- Compute attention scores: attention = softmax(Q_{n+1} * K^T / sqrt(d_k))
- Compute weighted value: output = attention * V
- Pass through the feed-forward network and layer norm
- Project to vocabulary logits
- Sample token x_{n+1} from logits using configured sampling strategy
- Sampling: The sampling process applies, in order:
- Logit bias -- Additive bias from
logit_biasparameter - Custom logit processors -- User-registered
LogitProcessorfunctions - Repetition/frequency/presence penalties -- Multiplicative and additive adjustments based on token occurrence history
- Temperature scaling -- Division of logits by temperature
- Top-p (nucleus) filtering -- Zeroing out tokens outside the top-p probability mass
- Categorical sampling -- Drawing a token from the resulting distribution
- Logit bias -- Additive bias from
- Termination check: Compare generated token against stop tokens, check
max_tokens, check context window overflow, and check interrupt signal.
Performance Characteristics
- Prefill throughput is measured in tokens/second for the input processing phase and is typically much higher than decode throughput due to parallelism
- Decode throughput is measured in tokens/second for the generation phase and is bottlenecked by memory bandwidth
- Time to first token is dominated by the prefill phase and grows with input length
- Time per output token corresponds to a single decode step
These metrics are reported in CompletionUsage.extra as prefill_tokens_per_s, decode_tokens_per_s, time_to_first_token_s, and time_per_output_token_s.