Heuristic:Mlc ai Web llm Multi Round KV Cache Reuse
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, Performance |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
Automatic optimization that detects multi-round chat patterns and preserves the KV cache between turns, avoiding redundant prefill of previously processed tokens.
Description
WebLLM's chat completion API is completely functional in behavior: each request is independent and a previous request does not affect the current request's result. Users must maintain and pass the full conversation history with each call. However, as an implicit internal optimization, WebLLM detects when the user is performing multi-round chatting (i.e., the new messages array starts with the same prefix as the previous call) and preserves the KV cache, only prefilling the new tokens. This can save seconds of processing time per turn in long conversations.
Usage
Use this heuristic when building multi-round chat applications. Simply pass the full growing conversation history in each `chat.completions.create()` call. The optimization is automatic and requires no configuration.
The Insight (Rule of Thumb)
- Action: Always pass the complete conversation history (all previous messages + new user message) in each request. Do not try to manually manage which tokens have been processed.
- Value: Automatic KV cache reuse means only new tokens are prefilled. For a 10-turn conversation with 500 tokens per turn, turn 10 only prefills ~500 tokens instead of ~5000.
- Trade-off: None. This is a pure optimization with no downsides when conversation history grows monotonically. If you reset the conversation or change earlier messages, the cache is invalidated and full prefill occurs.
- Anti-pattern: Do NOT call `engine.resetChat()` between turns of the same conversation — this clears the KV cache and forces full re-prefill.
Reasoning
In autoregressive LLM inference, the prefill phase processes all input tokens through the model to populate the KV cache. For multi-round chat, the first N-1 turns have already been prefilled in previous requests. By detecting that the current input starts with the same token prefix, WebLLM can skip re-processing those tokens and only prefill the delta (new user message + any new system context). This is particularly impactful for long conversations where prefill time dominates.
API documentation from `src/types.ts:123-128`:
/**
* @note The API is completely functional in behavior. That is, a previous
* request would not affect the current request's result. Thus, for
* multi-round chatting, users are responsible for maintaining the chat
* history. With that being said, as an implicit internal optimization, if
* we detect that the user is performing multi-round chatting, we will
* preserve the KV cache and only prefill the new tokens.
*/