Implementation:Turboderp org Exllamav2 ExLlamaV2Cache
| Knowledge Sources | |
|---|---|
| Domains | Memory_Management, Inference_Optimization, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for allocating and managing key-value cache tensors for transformer inference, provided by exllamav2.
Description
ExLlamaV2Cache (and its quantized variants) allocates GPU memory to store key and value projection tensors for all transformer layers across the sequence length. The cache is indexed by layer and sequence position, and grows as tokens are generated during autoregressive decoding.
The class hierarchy provides four precision levels:
- ExLlamaV2Cache (FP16): Full precision, base class at exllamav2/cache.py:L235-256
- ExLlamaV2Cache_Q4: 4-bit quantized at exllamav2/cache.py:L586-606
- ExLlamaV2Cache_Q6: 6-bit quantized at exllamav2/cache.py:L611-631
- ExLlamaV2Cache_Q8: 8-bit quantized at exllamav2/cache.py:L636-656
All variants share a common base class (ExLlamaV2CacheBase) that manages cache metadata, sequence position tracking, and the current_seq_len counter.
When lazy=True is specified, the cache records its shape requirements without allocating GPU memory. The actual allocation is deferred until model.load_autosplit() places each cache tensor on the same GPU as its corresponding model layer.
Usage
Use ExLlamaV2Cache immediately after creating the model object and before loading weights. Choose the variant based on available VRAM:
- ExLlamaV2Cache for maximum quality when memory allows
- ExLlamaV2Cache_Q8 for a quality-preserving 2x memory reduction
- ExLlamaV2Cache_Q4 for maximum memory savings (4x reduction)
- Always use lazy=True when calling model.load_autosplit()
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/cache.py
- Lines: L235-256 (ExLlamaV2Cache FP16), L586-606 (Q4), L611-631 (Q6), L636-656 (Q8)
Signature
class ExLlamaV2Cache(ExLlamaV2CacheBase):
def __init__(
self,
model,
batch_size: int = 1,
max_seq_len: int = -1,
copy_from: ExLlamaV2CacheBase | None = None,
lazy: bool = False,
num_key_value_heads: int | None = None,
fixed_device: torch.device | None = None,
):
...
Import
from exllamav2 import ExLlamaV2Cache
# Quantized variants:
from exllamav2 import ExLlamaV2Cache_Q4
from exllamav2 import ExLlamaV2Cache_Q6
from exllamav2 import ExLlamaV2Cache_Q8
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | ExLlamaV2 | Yes | Initialized ExLlamaV2 model instance (config must be prepared) |
| batch_size | int | No (default 1) | Number of sequences to cache simultaneously |
| max_seq_len | int | No (default -1) | Maximum sequence length; -1 uses model's default from config |
| copy_from | ExLlamaV2CacheBase | No (default None) | Copy cache contents from another cache instance |
| lazy | bool | No (default False) | Defer memory allocation for use with load_autosplit() |
| num_key_value_heads | int | No (default None) | Override number of KV heads; None uses model config |
| fixed_device | torch.device | No (default None) | Force all cache tensors onto a specific device |
Outputs
| Name | Type | Description |
|---|---|---|
| cache instance | ExLlamaV2Cache | Cache object with KV storage for all layers, ready for inference |
| cache.current_seq_len | int | Current number of cached tokens (starts at 0) |
| cache.max_seq_len | int | Maximum cacheable sequence length |
| cache.batch_size | int | Number of concurrent sequences supported |
Usage Examples
Basic FP16 Cache
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
# Allocate FP16 cache with default sequence length
cache = ExLlamaV2Cache(model, batch_size=1)
Lazy Cache for Auto-Split
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
# Lazy allocation - memory reserved but not allocated
cache = ExLlamaV2Cache(model, lazy=True)
# Auto-split loads model and allocates cache across GPUs
model.load_autosplit(cache)
Quantized Cache for Memory Savings
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache_Q4
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
# Q4 cache uses ~25% of FP16 memory
cache = ExLlamaV2Cache_Q4(model, lazy=True, max_seq_len=8192)
model.load_autosplit(cache)