Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 Load Autosplit

From Leeroopedia
Knowledge Sources
Domains Model_Loading, Multi_GPU, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for loading model weights with automatic distribution across available GPUs, provided by exllamav2.

Description

load_autosplit() is a method on the ExLlamaV2 model class that loads all model weights from disk and automatically distributes transformer layers across available CUDA devices. It works in conjunction with a lazily-allocated cache to ensure KV cache tensors are co-located with their corresponding model layers.

The method iterates through model layers, loading each onto the current GPU. When VRAM on the current device is insufficient for the next layer (minus the reserve), it advances to the next GPU. The lazy cache tensors are materialized on the same device as their layer.

This approach eliminates the need for manual device maps and handles the common case where a model is slightly too large for one GPU but fits comfortably across two.

Usage

Use load_autosplit() whenever loading a model for inference, especially when:

  • The model may exceed single-GPU VRAM capacity
  • You want automatic and optimal GPU distribution
  • You prefer not to manually compute device maps

For single-GPU scenarios, load_autosplit() still works correctly, placing everything on GPU 0.

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/model.py
  • Lines: L476-524

Signature

def load_autosplit(
    self,
    cache: ExLlamaV2CacheBase,
    reserve_vram: list[int] | None = None,
    last_id_only: bool = False,
    callback: callable | None = None,
    callback_gen: callable | None = None,
    progress: bool = False,
):
    ...

Import

from exllamav2 import ExLlamaV2

# load_autosplit is a method on the ExLlamaV2 model instance

I/O Contract

Inputs

Name Type Required Description
cache ExLlamaV2CacheBase Yes Lazily-allocated cache instance (created with lazy=True)
reserve_vram list[int] or None No (default None) Bytes to reserve on each GPU; None uses a sensible default. List is indexed by GPU ordinal.
last_id_only bool No (default False) If True, model only outputs logits for the last token position (saves memory for generation-only use)
callback callable or None No (default None) Called with progress info during loading (legacy interface)
callback_gen callable or None No (default None) Generator-based callback for async progress updates
progress bool No (default False) If True, display a tqdm progress bar during loading

Outputs

Name Type Description
(side effect) None Model weights are loaded into GPU memory across available devices
(side effect) None Lazy cache tensors are allocated on the same devices as their corresponding layers
model.loaded bool Set to True upon successful loading

Usage Examples

Basic Auto-Split Loading

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)

model.load_autosplit(cache)

With VRAM Reservation and Progress

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)

# Reserve 256 MB on GPU 0, 512 MB on GPU 1
reserve = [256 * 1024**2, 512 * 1024**2]
model.load_autosplit(cache, reserve_vram=reserve, progress=True)

Full Pipeline Example

from exllamav2 import (
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Tokenizer,
)

# 1. Configure
config = ExLlamaV2Config("/path/to/model")
config.prepare()

# 2. Create model and lazy cache
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)

# 3. Load with auto-split
model.load_autosplit(cache, progress=True)

# 4. Initialize tokenizer
tokenizer = ExLlamaV2Tokenizer(config)

# Model is now ready for inference

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment