Principle:Turboderp org Exllamav2 Device Management

Knowledge Sources	ExLlamaV2 CUDA Programming Guide
Domains	GPU_Computing, Device_Management, Multi_GPU
Last Updated	2026-02-15 00:00 GMT

Overview

Device management handles the allocation and tracking of GPU resources across one or more CUDA devices, providing a per-device context that encapsulates memory pools, CUDA streams, and peer-to-peer access state.

Description

When running large language models across multiple GPUs, each device needs its own execution context. ExLlamaV2's device management system provides:

Per-device context: Each GPU gets an ExLlamaV2DeviceContext object that tracks the device index, available VRAM, allocated scratch space, and CUDA stream. Contexts are created lazily when a device is first used.

Scratch memory allocation: A shared scratch buffer is allocated per device for temporary computation results. The required size is computed during model loading based on the layers assigned to each device.

CUDA stream management: Each device context maintains its own CUDA stream for asynchronous kernel execution. Operations on different devices can overlap via their independent streams.

Peer-to-peer access: For multi-GPU setups, the system configures peer access between devices where available, enabling direct GPU-to-GPU memory transfers without routing through CPU memory.

Safe tensor movement: Utility functions handle moving tensors between devices with proper error handling and fallback to CPU staging when direct peer access is unavailable.

Usage

Device management is used when:

Multi-GPU model loading: Auto-split distributes model layers across GPUs, each needing its own device context
Tensor parallelism: TP operations require coordinated memory access across devices
Memory-constrained inference: Tracking per-device VRAM usage to prevent out-of-memory errors
Mixed-device operations: Moving tensors between GPUs during forward pass execution

Theoretical Basis

Device Context Lifecycle

# Device contexts are created lazily:
# 1. Model.load() determines layer-to-device mapping
# 2. First access to device d creates ExLlamaV2DeviceContext(d)
# 3. Context allocates scratch buffer on device d
# 4. Peer access is configured between all active devices
# 5. During inference, each layer executes on its assigned device

Peer Access Configuration

# For devices d1, d2 in active set:
# If torch.cuda.can_device_access_peer(d1, d2):
#   torch.cuda.enable_peer_access(d1, d2)
#   # Direct GPU-to-GPU transfers enabled
# Else:
#   # Fallback: tensor.to(cpu).to(target_device)
#   # Higher latency but always works

Related Pages

Implemented By

Implementation:Turboderp_org_Exllamav2_ExLlamaV2DeviceContext

Related Principles

Principle:Turboderp_org_Exllamav2_Model_Weight_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment