Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Turboderp org Exllamav2 Paged Cache Configuration

From Leeroopedia
Knowledge Sources
Domains Memory_Management, Inference_Optimization
Last Updated 2026-02-15 00:00 GMT

Overview

Configuration rules for the paged KV cache system: page size is fixed at 256 tokens, cache must be batch_size=1, and Q4 cache is preferred over FP8 for the dynamic generator.

Description

The ExLlamaV2 dynamic generator uses a paged attention system where the KV cache is divided into fixed-size pages of 256 tokens. This design enables virtual memory-style cache management with prefix deduplication and dynamic allocation. Several non-obvious constraints govern how the cache must be configured for correct operation.

Usage

Apply these rules when configuring ExLlamaV2DynamicGenerator with any cache variant. Violating these constraints causes assertion failures or degraded performance.

The Insight (Rule of Thumb)

  • Action: Set cache `batch_size=1` when using the dynamic generator, even for concurrent multi-sequence inference.
  • Value: Page size is fixed at 256 tokens. Both `max_seq_len` and `max_chunk_size` must be multiples of 256.
  • Action: Use `ExLlamaV2Cache_Q4` (4-bit) instead of `ExLlamaV2Cache_8bit` (FP8) for the dynamic generator.
  • Trade-off: Q4 provides 4x memory reduction over FP16 with minimal quality loss. FP8 is explicitly unsupported and raises an assertion error.
  • Action: Cache defragmentation is automatic but only triggers when >10% of pages are fragmented and all pages have been accessed since last defrag.
  • Trade-off: Defragmentation involves GPU memory copies. The 10% threshold avoids wasteful defrag for minor fragmentation.

Reasoning

The paged attention system manages batching internally through page allocation rather than through the cache's batch dimension. This is why `batch_size=1` is required despite supporting many concurrent sequences. The 256-token page size is a balance between fragmentation overhead (smaller pages = more metadata) and memory waste (larger pages = more unused tokens per page). Q4 cache support is implemented through the Flash Attention paged API, but FP8 quantization has different block alignment requirements that are incompatible with the paging system.

Page deduplication uses blake2b hashing with 16-byte digests, where each page's hash chains to the previous page's hash. This enables efficient prefix matching across jobs sharing common system prompts.

From `exllamav2/generator/dynamic.py:33`:

PAGED_PAGE_SIZE = 256

From `exllamav2/generator/dynamic.py:386-387`:

assert cache.batch_size == 1, "DynamicGenerator requires cache to have batch_size = 1"

From `exllamav2/generator/dynamic.py:359-360`:

assert not isinstance(cache, ExLlamaV2Cache_8bit), \
    "Dynamic generator does not currently work with 8-bit cache. Use either FP16 or Q4."

From `exllamav2/generator/dynamic.py:1442-1445`:

# Don't bother if less than 10% of cache is fragmented
if len(defrag_map) <= self.max_pages // 10:
    return

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment