Implementation:Turboderp org Exllamav2 Load Autosplit
| Knowledge Sources | |
|---|---|
| Domains | Model_Loading, Multi_GPU, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for loading model weights with automatic distribution across available GPUs, provided by exllamav2.
Description
load_autosplit() is a method on the ExLlamaV2 model class that loads all model weights from disk and automatically distributes transformer layers across available CUDA devices. It works in conjunction with a lazily-allocated cache to ensure KV cache tensors are co-located with their corresponding model layers.
The method iterates through model layers, loading each onto the current GPU. When VRAM on the current device is insufficient for the next layer (minus the reserve), it advances to the next GPU. The lazy cache tensors are materialized on the same device as their layer.
This approach eliminates the need for manual device maps and handles the common case where a model is slightly too large for one GPU but fits comfortably across two.
Usage
Use load_autosplit() whenever loading a model for inference, especially when:
- The model may exceed single-GPU VRAM capacity
- You want automatic and optimal GPU distribution
- You prefer not to manually compute device maps
For single-GPU scenarios, load_autosplit() still works correctly, placing everything on GPU 0.
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/model.py
- Lines: L476-524
Signature
def load_autosplit(
self,
cache: ExLlamaV2CacheBase,
reserve_vram: list[int] | None = None,
last_id_only: bool = False,
callback: callable | None = None,
callback_gen: callable | None = None,
progress: bool = False,
):
...
Import
from exllamav2 import ExLlamaV2
# load_autosplit is a method on the ExLlamaV2 model instance
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cache | ExLlamaV2CacheBase | Yes | Lazily-allocated cache instance (created with lazy=True) |
| reserve_vram | list[int] or None | No (default None) | Bytes to reserve on each GPU; None uses a sensible default. List is indexed by GPU ordinal. |
| last_id_only | bool | No (default False) | If True, model only outputs logits for the last token position (saves memory for generation-only use) |
| callback | callable or None | No (default None) | Called with progress info during loading (legacy interface) |
| callback_gen | callable or None | No (default None) | Generator-based callback for async progress updates |
| progress | bool | No (default False) | If True, display a tqdm progress bar during loading |
Outputs
| Name | Type | Description |
|---|---|---|
| (side effect) | None | Model weights are loaded into GPU memory across available devices |
| (side effect) | None | Lazy cache tensors are allocated on the same devices as their corresponding layers |
| model.loaded | bool | Set to True upon successful loading |
Usage Examples
Basic Auto-Split Loading
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)
With VRAM Reservation and Progress
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
# Reserve 256 MB on GPU 0, 512 MB on GPU 1
reserve = [256 * 1024**2, 512 * 1024**2]
model.load_autosplit(cache, reserve_vram=reserve, progress=True)
Full Pipeline Example
from exllamav2 import (
ExLlamaV2,
ExLlamaV2Config,
ExLlamaV2Cache,
ExLlamaV2Tokenizer,
)
# 1. Configure
config = ExLlamaV2Config("/path/to/model")
config.prepare()
# 2. Create model and lazy cache
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
# 3. Load with auto-split
model.load_autosplit(cache, progress=True)
# 4. Initialize tokenizer
tokenizer = ExLlamaV2Tokenizer(config)
# Model is now ready for inference