Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Turboderp org Exllamav2 KV Cache Allocation

From Leeroopedia
Knowledge Sources
Domains Memory_Management, Inference_Optimization, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Transformer inference requires a key-value cache to store previously computed attention projections, avoiding redundant recomputation during autoregressive token generation.

Description

In autoregressive text generation, each new token attends to all previous tokens in the sequence. Without caching, generating the N-th token would require recomputing the key and value projections for all N-1 preceding tokens at every layer, resulting in O(N^2) total computation for generating a full sequence.

The KV cache stores the key (K) and value (V) projection tensors for each transformer layer at each sequence position. When generating a new token, only the new token's K and V vectors are computed and appended to the cache, while all previous K and V vectors are read directly from memory.

ExLlamaV2 supports multiple cache precision levels:

  • FP16 (ExLlamaV2Cache): Full 16-bit floating-point precision. Maximum quality with no quantization artifacts, but highest memory consumption.
  • Q8 (ExLlamaV2Cache_Q8): 8-bit quantized cache. Approximately half the memory of FP16 with minimal quality degradation.
  • Q6 (ExLlamaV2Cache_Q6): 6-bit quantized cache. Further memory savings with slightly more quality loss.
  • Q4 (ExLlamaV2Cache_Q4): 4-bit quantized cache. Maximum memory savings (quarter of FP16) but with noticeable quality trade-offs, particularly for long sequences.

The cache can be allocated lazily, meaning memory is reserved but not actually allocated until the model is loaded. This is essential for the auto-split feature, which distributes model layers across multiple GPUs and needs to co-locate cache tensors with their corresponding model layers on the same device.

Usage

KV cache allocation is required for any inference operation. Choose the cache precision based on your memory constraints:

  • Use FP16 when VRAM is sufficient and maximum quality is desired
  • Use Q8 as a good balance of memory savings and quality
  • Use Q4/Q6 when fitting larger models or longer contexts into limited VRAM
  • Use lazy=True when using load_autosplit() for multi-GPU deployment

Theoretical Basis

For a transformer with L layers, h key-value heads, head dimension d, and sequence length S, the KV cache memory requirement is:

# FP16 cache memory per sequence:
memory = 2 * L * h * d * S * 2 bytes
       = 4 * L * h * d * S bytes

# Where:
#   2 = key + value tensors
#   L = number of layers
#   h = number of key-value heads
#   d = head dimension
#   S = sequence length
#   2 bytes = FP16 storage per element

# Example: Llama 70B (80 layers, 8 KV heads, 128 head dim, 4096 seq len)
# memory = 4 * 80 * 8 * 128 * 4096 = ~1.34 GB per sequence

# Quantized cache reduces by compression ratio:
# Q8:  memory_fp16 / 2
# Q6:  memory_fp16 / 2.67
# Q4:  memory_fp16 / 4

Lazy Allocation for Multi-GPU

function allocate_cache(model, lazy):
    if lazy:
        # Record cache shape requirements but defer allocation
        for layer in model.layers:
            layer.cache_shape = compute_cache_shape(layer)
        # Actual allocation happens during model.load_autosplit()
        # Each cache tensor is placed on the same GPU as its layer
    else:
        # Immediately allocate all cache tensors on the model's device
        for layer in model.layers:
            layer.key_cache = allocate_tensor(cache_shape, device)
            layer.value_cache = allocate_tensor(cache_shape, device)

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment