Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mlc ai Web llm Chat Inference

From Leeroopedia

Overview

Chat Inference is the process of generating text completions from a language model by iteratively predicting the next token given conversation context. In browser-based LLMs powered by WebGPU, this follows a prefill-decode architecture that splits the computation into two distinct phases optimized for different hardware characteristics.

Description

Chat inference in browser-based LLMs follows a two-phase autoregressive generation loop:

Prefill Phase

The prefill phase processes all input tokens in parallel to build the key-value (KV) cache. This phase:

  • Takes the full conversation context (system prompt + conversation history + current user message) as input
  • Processes all tokens through the transformer in a single batched forward pass
  • Populates the KV cache with attention key and value vectors for every input token
  • Produces the first output token
  • Is compute-bound -- dominated by batch matrix multiplication operations

Decode Phase

The decode phase generates tokens one at a time, autoregressively, using the KV cache for efficient attention computation. This phase:

  • Takes the most recently generated token as input
  • Computes attention over the existing KV cache entries (no need to reprocess earlier tokens)
  • Appends the new token's key and value vectors to the KV cache
  • Samples the next token from the model's output logit distribution according to configured sampling parameters
  • Repeats until a stop condition is met
  • Is memory-bound -- dominated by sequential attention lookups against the growing KV cache

Stop Conditions

Generation terminates when any of the following conditions is met:

  • stop -- The model generates a natural stop token or a user-specified stop sequence
  • length -- The generated output reaches max_tokens or the context window is exhausted
  • abort -- The user manually interrupts generation via engine.interruptGenerate()
  • tool_calls -- When function calling is active and the model finishes generating tool call output

Usage

Use chat inference as the core step after creating an engine and constructing a request. It supports two consumption modes:

  • Non-streaming -- The engine buffers all generated tokens internally and returns the complete response as a single ChatCompletion object. Use this when the complete response is needed before processing (e.g., JSON parsing, tool call extraction).
  • Streaming -- The engine yields ChatCompletionChunk objects as each token is generated, enabling real-time UI updates. Use this for interactive chat interfaces where perceived latency matters.

Multi-round conversation support is built into the inference flow. When the engine detects that the current request's conversation history matches the previous request (via compareConversationObject()), it reuses the existing KV cache and only processes the new user message in the prefill step. This optimization avoids redundant reprocessing of earlier conversation turns.

Theoretical Basis

Autoregressive language model inference operates as follows:

  1. Prefill: Process input tokens [x_1, ..., x_n] through the transformer to build KV cache entries for each layer's attention mechanism. The KV cache stores the projected key (K) and value (V) matrices for all processed positions, enabling O(1) attention lookups per new token.
  1. Decode: For each new token position n+1:
    1. Compute the query vector Q_{n+1} from the previous token's embedding
    2. Compute attention scores: attention = softmax(Q_{n+1} * K^T / sqrt(d_k))
    3. Compute weighted value: output = attention * V
    4. Pass through the feed-forward network and layer norm
    5. Project to vocabulary logits
    6. Sample token x_{n+1} from logits using configured sampling strategy
  1. Sampling: The sampling process applies, in order:
    1. Logit bias -- Additive bias from logit_bias parameter
    2. Custom logit processors -- User-registered LogitProcessor functions
    3. Repetition/frequency/presence penalties -- Multiplicative and additive adjustments based on token occurrence history
    4. Temperature scaling -- Division of logits by temperature
    5. Top-p (nucleus) filtering -- Zeroing out tokens outside the top-p probability mass
    6. Categorical sampling -- Drawing a token from the resulting distribution
  1. Termination check: Compare generated token against stop tokens, check max_tokens, check context window overflow, and check interrupt signal.

Performance Characteristics

  • Prefill throughput is measured in tokens/second for the input processing phase and is typically much higher than decode throughput due to parallelism
  • Decode throughput is measured in tokens/second for the generation phase and is bottlenecked by memory bandwidth
  • Time to first token is dominated by the prefill phase and grows with input length
  • Time per output token corresponds to a single decode step

These metrics are reported in CompletionUsage.extra as prefill_tokens_per_s, decode_tokens_per_s, time_to_first_token_s, and time_per_output_token_s.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment