Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 ExLlamaV2Cache

From Leeroopedia
Knowledge Sources
Domains Memory_Management, Inference_Optimization, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for allocating and managing key-value cache tensors for transformer inference, provided by exllamav2.

Description

ExLlamaV2Cache (and its quantized variants) allocates GPU memory to store key and value projection tensors for all transformer layers across the sequence length. The cache is indexed by layer and sequence position, and grows as tokens are generated during autoregressive decoding.

The class hierarchy provides four precision levels:

  • ExLlamaV2Cache (FP16): Full precision, base class at exllamav2/cache.py:L235-256
  • ExLlamaV2Cache_Q4: 4-bit quantized at exllamav2/cache.py:L586-606
  • ExLlamaV2Cache_Q6: 6-bit quantized at exllamav2/cache.py:L611-631
  • ExLlamaV2Cache_Q8: 8-bit quantized at exllamav2/cache.py:L636-656

All variants share a common base class (ExLlamaV2CacheBase) that manages cache metadata, sequence position tracking, and the current_seq_len counter.

When lazy=True is specified, the cache records its shape requirements without allocating GPU memory. The actual allocation is deferred until model.load_autosplit() places each cache tensor on the same GPU as its corresponding model layer.

Usage

Use ExLlamaV2Cache immediately after creating the model object and before loading weights. Choose the variant based on available VRAM:

  • ExLlamaV2Cache for maximum quality when memory allows
  • ExLlamaV2Cache_Q8 for a quality-preserving 2x memory reduction
  • ExLlamaV2Cache_Q4 for maximum memory savings (4x reduction)
  • Always use lazy=True when calling model.load_autosplit()

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/cache.py
  • Lines: L235-256 (ExLlamaV2Cache FP16), L586-606 (Q4), L611-631 (Q6), L636-656 (Q8)

Signature

class ExLlamaV2Cache(ExLlamaV2CacheBase):

    def __init__(
        self,
        model,
        batch_size: int = 1,
        max_seq_len: int = -1,
        copy_from: ExLlamaV2CacheBase | None = None,
        lazy: bool = False,
        num_key_value_heads: int | None = None,
        fixed_device: torch.device | None = None,
    ):
        ...

Import

from exllamav2 import ExLlamaV2Cache

# Quantized variants:
from exllamav2 import ExLlamaV2Cache_Q4
from exllamav2 import ExLlamaV2Cache_Q6
from exllamav2 import ExLlamaV2Cache_Q8

I/O Contract

Inputs

Name Type Required Description
model ExLlamaV2 Yes Initialized ExLlamaV2 model instance (config must be prepared)
batch_size int No (default 1) Number of sequences to cache simultaneously
max_seq_len int No (default -1) Maximum sequence length; -1 uses model's default from config
copy_from ExLlamaV2CacheBase No (default None) Copy cache contents from another cache instance
lazy bool No (default False) Defer memory allocation for use with load_autosplit()
num_key_value_heads int No (default None) Override number of KV heads; None uses model config
fixed_device torch.device No (default None) Force all cache tensors onto a specific device

Outputs

Name Type Description
cache instance ExLlamaV2Cache Cache object with KV storage for all layers, ready for inference
cache.current_seq_len int Current number of cached tokens (starts at 0)
cache.max_seq_len int Maximum cacheable sequence length
cache.batch_size int Number of concurrent sequences supported

Usage Examples

Basic FP16 Cache

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)

# Allocate FP16 cache with default sequence length
cache = ExLlamaV2Cache(model, batch_size=1)

Lazy Cache for Auto-Split

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)

# Lazy allocation - memory reserved but not allocated
cache = ExLlamaV2Cache(model, lazy=True)

# Auto-split loads model and allocates cache across GPUs
model.load_autosplit(cache)

Quantized Cache for Memory Savings

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache_Q4

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)

# Q4 cache uses ~25% of FP16 memory
cache = ExLlamaV2Cache_Q4(model, lazy=True, max_seq_len=8192)
model.load_autosplit(cache)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment