Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Mlc ai Web llm KV Cache Window Configuration

From Leeroopedia
Knowledge Sources
Domains Optimization, Memory_Management, LLMs
Last Updated 2026-02-14 22:00 GMT

Overview

Configuration guide for choosing between context window and sliding window KV cache strategies, with attention sink sizing for memory-efficient long-context inference.

Description

WebLLM supports two mutually exclusive KV cache strategies: context window (standard fixed-size cache) and sliding window (memory-efficient rolling cache with attention sinks). Only one can be active at a time. Choosing the wrong configuration throws a `WindowSizeConfigurationError`. Sliding window trades some long-range attention for dramatically reduced VRAM usage, enabling longer conversations on constrained devices. Additionally, using `-1k` context window model variants (1024 tokens instead of 4096) can reduce VRAM by 4-7x.

Usage

Use this heuristic when you are VRAM constrained (getting device lost errors), need to support long conversations, or are deploying to mobile devices. Apply via `ModelRecord.overrides` when calling `engine.reload()`.

The Insight (Rule of Thumb)

  • Action 1: Choose one of `context_window_size` or `sliding_window_size`; set the other to `-1`. Never set both to positive values.
  • Action 2: If using sliding window, must specify `attention_sink_size >= 0` (use `0` as default).
  • Action 3: For low-VRAM devices, use model variants with `-1k` suffix (1024 context window) instead of the default 4096.
  • Value: Reducing context from 4096 to 1024 cuts VRAM from ~2.9 GB to ~0.88 GB for Llama-3.2-1B (4-7x reduction).
  • Trade-off: Smaller context window limits maximum prompt + response length. Sliding window loses attention to early tokens beyond the window.

Reasoning

The KV cache stores key-value pairs for all attention layers across all tokens in the sequence. Its memory scales linearly with `context_window_size * num_layers * hidden_dim`. By reducing the window size (via `-1k` variants) or using a sliding window (which caps the stored token count), the KV cache allocation shrinks proportionally. The hardcoded page size of 16 tokens in PagedKVCache means memory is allocated in 16-token pages, making smaller windows significantly more efficient.

VRAM evidence from model registry:

  • Llama-3.2-1B (4k context, q4f32): 1128.82 MB
  • Llama-3.2-1B (1k context, q4f32): 879.04 MB (q4f16 variant)
  • Llama-3.1-8B (4k context, q4f32): 6101.01 MB
  • Llama-3.1-8B (4k context, q4f16): 5001.0 MB

KV cache configuration validation from `src/llm_chat.ts:256-278`:

this.slidingWindowSize = config.sliding_window_size;
this.contextWindowSize = config.context_window_size;
this.attentionSinkSize = config.attention_sink_size;
if (this.contextWindowSize !== -1 && this.slidingWindowSize !== -1) {
  throw new WindowSizeConfigurationError(
    this.contextWindowSize,
    this.slidingWindowSize,
  );
} else if (this.slidingWindowSize != -1) {
  log.info("Using slidingWindowSize: ", this.slidingWindowSize);
  if (this.attentionSinkSize >= 0) {
    log.info("Using attentionSinkSize: ", this.attentionSinkSize);
  } else {
    throw new AttentionSinkSizeError();
  }
} else if (this.contextWindowSize != -1) {
  log.info("Using contextWindowSize: ", this.contextWindowSize);
} else {
  throw new WindowSizeSpecificationError();
}

PagedKVCache creation with hardcoded page size from `src/llm_chat.ts:303-319`:

const defaultPageSize = 16;
const defaultMaxNumSequence = 1;
const maxTotalSeqLen =
  this.slidingWindowSize != -1
    ? this.slidingWindowSize
    : this.contextWindowSize;
this.kvCache = this.tvm.detachFromCurrentScope(
  fcreateCache(
    this.tvm.makeShapeTuple([defaultMaxNumSequence]),
    this.tvm.makeShapeTuple([maxTotalSeqLen]),
    this.tvm.makeShapeTuple([this.prefillChunkSize]),
    this.tvm.makeShapeTuple([defaultPageSize]),
    this.tvm.makeShapeTuple([this.slidingWindowSize != -1 ? 1 : 0]),
  ),
);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment