Principle:Turboderp org Exllamav2 Model Weight Loading

Knowledge Sources	ExLlamaV2
Domains	Model_Loading, Multi_GPU, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Large language models must be loaded from disk into GPU memory for inference, with automatic device splitting to distribute layers across multiple GPUs when a model exceeds single-GPU capacity.

Description

Modern LLMs can range from a few gigabytes to hundreds of gigabytes in weight data. Loading these weights efficiently requires:

Sequential layer loading: Weights are loaded layer by layer from safetensors files on disk into GPU memory. Each layer's tensors (attention projections, feed-forward weights, layer norms) are loaded as a unit.

Automatic device splitting: When a model is too large for a single GPU, the auto-split algorithm distributes layers across available GPUs. It iteratively loads layers onto the current GPU, monitoring VRAM consumption. When the current GPU runs out of space (accounting for a configurable reserve), it moves to the next available GPU.

Lazy cache co-location: When using auto-split with a lazy cache, the cache tensors for each layer are allocated on the same GPU as the layer's model weights. This avoids expensive cross-GPU data transfers during attention computation.

VRAM reservation: Users can specify how much VRAM to reserve on each GPU for other purposes (operating system, other applications, intermediate computation buffers). The auto-split algorithm respects these reservations.

Progress reporting: For large models that take significant time to load, callback functions and progress bars provide visibility into the loading process.

The loading process also handles weight format conversions, dequantization setup for EXL2/GPTQ weights, and validation that all expected tensors are present.

Usage

Model weight loading is performed after configuration and cache allocation, but before any inference:

Use load_autosplit() for automatic multi-GPU distribution (recommended for most use cases)
Use load() with explicit device maps for manual control over layer placement
Always pass a lazy cache to load_autosplit() for proper cache co-location

Theoretical Basis

# Auto-split algorithm pseudocode:
function load_autosplit(model, cache, reserve_vram):
    current_gpu = 0

    for layer in model.layers:
        layer_size = estimate_layer_memory(layer)
        cache_size = estimate_cache_memory(layer, cache)
        available = gpu_free_memory(current_gpu) - reserve_vram[current_gpu]

        if layer_size + cache_size > available:
            current_gpu += 1
            if current_gpu >= num_gpus:
                raise OutOfMemoryError("Model too large for available GPUs")

        load_layer_to_device(layer, gpu=current_gpu)
        allocate_cache_on_device(cache, layer, gpu=current_gpu)

    return model, cache

The algorithm is greedy: it fills each GPU sequentially until VRAM is exhausted, then moves to the next. This simple approach works well in practice because transformer layers are uniform in size, making the distribution roughly even.

Memory Estimation

# Per-layer memory for a typical transformer:
#   Attention: q_proj + k_proj + v_proj + o_proj
#   FFN: gate_proj + up_proj + down_proj (for SwiGLU)
#   Norms: input_layernorm + post_attention_layernorm
#
# For EXL2/GPTQ: actual memory depends on quantization bits per weight
# For FP16: each parameter = 2 bytes

Related Pages

Implemented By

Implementation:Turboderp_org_Exllamav2_Load_Autosplit

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment