Principle:Turboderp org Exllamav2 Model Weight Loading
| Knowledge Sources | |
|---|---|
| Domains | Model_Loading, Multi_GPU, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Large language models must be loaded from disk into GPU memory for inference, with automatic device splitting to distribute layers across multiple GPUs when a model exceeds single-GPU capacity.
Description
Modern LLMs can range from a few gigabytes to hundreds of gigabytes in weight data. Loading these weights efficiently requires:
- Sequential layer loading: Weights are loaded layer by layer from safetensors files on disk into GPU memory. Each layer's tensors (attention projections, feed-forward weights, layer norms) are loaded as a unit.
- Automatic device splitting: When a model is too large for a single GPU, the auto-split algorithm distributes layers across available GPUs. It iteratively loads layers onto the current GPU, monitoring VRAM consumption. When the current GPU runs out of space (accounting for a configurable reserve), it moves to the next available GPU.
- Lazy cache co-location: When using auto-split with a lazy cache, the cache tensors for each layer are allocated on the same GPU as the layer's model weights. This avoids expensive cross-GPU data transfers during attention computation.
- VRAM reservation: Users can specify how much VRAM to reserve on each GPU for other purposes (operating system, other applications, intermediate computation buffers). The auto-split algorithm respects these reservations.
- Progress reporting: For large models that take significant time to load, callback functions and progress bars provide visibility into the loading process.
The loading process also handles weight format conversions, dequantization setup for EXL2/GPTQ weights, and validation that all expected tensors are present.
Usage
Model weight loading is performed after configuration and cache allocation, but before any inference:
- Use load_autosplit() for automatic multi-GPU distribution (recommended for most use cases)
- Use load() with explicit device maps for manual control over layer placement
- Always pass a lazy cache to load_autosplit() for proper cache co-location
Theoretical Basis
# Auto-split algorithm pseudocode:
function load_autosplit(model, cache, reserve_vram):
current_gpu = 0
for layer in model.layers:
layer_size = estimate_layer_memory(layer)
cache_size = estimate_cache_memory(layer, cache)
available = gpu_free_memory(current_gpu) - reserve_vram[current_gpu]
if layer_size + cache_size > available:
current_gpu += 1
if current_gpu >= num_gpus:
raise OutOfMemoryError("Model too large for available GPUs")
load_layer_to_device(layer, gpu=current_gpu)
allocate_cache_on_device(cache, layer, gpu=current_gpu)
return model, cache
The algorithm is greedy: it fills each GPU sequentially until VRAM is exhausted, then moves to the next. This simple approach works well in practice because transformer layers are uniform in size, making the distribution roughly even.
Memory Estimation
# Per-layer memory for a typical transformer:
# Attention: q_proj + k_proj + v_proj + o_proj
# FFN: gate_proj + up_proj + down_proj (for SwiGLU)
# Norms: input_layernorm + post_attention_layernorm
#
# For EXL2/GPTQ: actual memory depends on quantization bits per weight
# For FP16: each parameter = 2 bytes