Implementation:Turboderp org Exllamav2 Load Autosplit

Knowledge Sources	ExLlamaV2
Domains	Model_Loading, Multi_GPU, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for loading model weights with automatic distribution across available GPUs, provided by exllamav2.

Description

load_autosplit() is a method on the ExLlamaV2 model class that loads all model weights from disk and automatically distributes transformer layers across available CUDA devices. It works in conjunction with a lazily-allocated cache to ensure KV cache tensors are co-located with their corresponding model layers.

The method iterates through model layers, loading each onto the current GPU. When VRAM on the current device is insufficient for the next layer (minus the reserve), it advances to the next GPU. The lazy cache tensors are materialized on the same device as their layer.

This approach eliminates the need for manual device maps and handles the common case where a model is slightly too large for one GPU but fits comfortably across two.

Usage

Use load_autosplit() whenever loading a model for inference, especially when:

The model may exceed single-GPU VRAM capacity
You want automatic and optimal GPU distribution
You prefer not to manually compute device maps

For single-GPU scenarios, load_autosplit() still works correctly, placing everything on GPU 0.

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/model.py
Lines: L476-524

Signature

def load_autosplit(
    self,
    cache: ExLlamaV2CacheBase,
    reserve_vram: list[int] | None = None,
    last_id_only: bool = False,
    callback: callable | None = None,
    callback_gen: callable | None = None,
    progress: bool = False,
):
    ...

Import

from exllamav2 import ExLlamaV2

# load_autosplit is a method on the ExLlamaV2 model instance

I/O Contract

Inputs

Name	Type	Required	Description
cache	ExLlamaV2CacheBase	Yes	Lazily-allocated cache instance (created with lazy=True)
reserve_vram	list[int] or None	No (default None)	Bytes to reserve on each GPU; None uses a sensible default. List is indexed by GPU ordinal.
last_id_only	bool	No (default False)	If True, model only outputs logits for the last token position (saves memory for generation-only use)
callback	callable or None	No (default None)	Called with progress info during loading (legacy interface)
callback_gen	callable or None	No (default None)	Generator-based callback for async progress updates
progress	bool	No (default False)	If True, display a tqdm progress bar during loading

Outputs

Name	Type	Description
(side effect)	None	Model weights are loaded into GPU memory across available devices
(side effect)	None	Lazy cache tensors are allocated on the same devices as their corresponding layers
model.loaded	bool	Set to True upon successful loading

Usage Examples

Basic Auto-Split Loading

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)

model.load_autosplit(cache)

With VRAM Reservation and Progress

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)

# Reserve 256 MB on GPU 0, 512 MB on GPU 1
reserve = [256 * 1024**2, 512 * 1024**2]
model.load_autosplit(cache, reserve_vram=reserve, progress=True)

Full Pipeline Example

from exllamav2 import (
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Tokenizer,
)

# 1. Configure
config = ExLlamaV2Config("/path/to/model")
config.prepare()

# 2. Create model and lazy cache
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)

# 3. Load with auto-split
model.load_autosplit(cache, progress=True)

# 4. Initialize tokenizer
tokenizer = ExLlamaV2Tokenizer(config)

# Model is now ready for inference

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Model_Weight_Loading

Requires Environment

Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment