Implementation:Turboderp org Exllamav2 ExLlamaV2Cache

Knowledge Sources	ExLlamaV2
Domains	Memory_Management, Inference_Optimization, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for allocating and managing key-value cache tensors for transformer inference, provided by exllamav2.

Description

ExLlamaV2Cache (and its quantized variants) allocates GPU memory to store key and value projection tensors for all transformer layers across the sequence length. The cache is indexed by layer and sequence position, and grows as tokens are generated during autoregressive decoding.

The class hierarchy provides four precision levels:

ExLlamaV2Cache (FP16): Full precision, base class at exllamav2/cache.py:L235-256
ExLlamaV2Cache_Q4: 4-bit quantized at exllamav2/cache.py:L586-606
ExLlamaV2Cache_Q6: 6-bit quantized at exllamav2/cache.py:L611-631
ExLlamaV2Cache_Q8: 8-bit quantized at exllamav2/cache.py:L636-656

All variants share a common base class (ExLlamaV2CacheBase) that manages cache metadata, sequence position tracking, and the current_seq_len counter.

When lazy=True is specified, the cache records its shape requirements without allocating GPU memory. The actual allocation is deferred until model.load_autosplit() places each cache tensor on the same GPU as its corresponding model layer.

Usage

Use ExLlamaV2Cache immediately after creating the model object and before loading weights. Choose the variant based on available VRAM:

ExLlamaV2Cache for maximum quality when memory allows
ExLlamaV2Cache_Q8 for a quality-preserving 2x memory reduction
ExLlamaV2Cache_Q4 for maximum memory savings (4x reduction)
Always use lazy=True when calling model.load_autosplit()

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/cache.py
Lines: L235-256 (ExLlamaV2Cache FP16), L586-606 (Q4), L611-631 (Q6), L636-656 (Q8)

Signature

class ExLlamaV2Cache(ExLlamaV2CacheBase):

    def __init__(
        self,
        model,
        batch_size: int = 1,
        max_seq_len: int = -1,
        copy_from: ExLlamaV2CacheBase | None = None,
        lazy: bool = False,
        num_key_value_heads: int | None = None,
        fixed_device: torch.device | None = None,
    ):
        ...

Import

from exllamav2 import ExLlamaV2Cache

# Quantized variants:
from exllamav2 import ExLlamaV2Cache_Q4
from exllamav2 import ExLlamaV2Cache_Q6
from exllamav2 import ExLlamaV2Cache_Q8

I/O Contract

Inputs

Name	Type	Required	Description
model	ExLlamaV2	Yes	Initialized ExLlamaV2 model instance (config must be prepared)
batch_size	int	No (default 1)	Number of sequences to cache simultaneously
max_seq_len	int	No (default -1)	Maximum sequence length; -1 uses model's default from config
copy_from	ExLlamaV2CacheBase	No (default None)	Copy cache contents from another cache instance
lazy	bool	No (default False)	Defer memory allocation for use with load_autosplit()
num_key_value_heads	int	No (default None)	Override number of KV heads; None uses model config
fixed_device	torch.device	No (default None)	Force all cache tensors onto a specific device

Outputs

Name	Type	Description
cache instance	ExLlamaV2Cache	Cache object with KV storage for all layers, ready for inference
cache.current_seq_len	int	Current number of cached tokens (starts at 0)
cache.max_seq_len	int	Maximum cacheable sequence length
cache.batch_size	int	Number of concurrent sequences supported

Usage Examples

Basic FP16 Cache

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)

# Allocate FP16 cache with default sequence length
cache = ExLlamaV2Cache(model, batch_size=1)

Lazy Cache for Auto-Split

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)

# Lazy allocation - memory reserved but not allocated
cache = ExLlamaV2Cache(model, lazy=True)

# Auto-split loads model and allocates cache across GPUs
model.load_autosplit(cache)

Quantized Cache for Memory Savings

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache_Q4

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)

# Q4 cache uses ~25% of FP16 memory
cache = ExLlamaV2Cache_Q4(model, lazy=True, max_seq_len=8192)
model.load_autosplit(cache)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment