Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Turboderp org Exllamav2 Device Management

From Leeroopedia
Knowledge Sources
Domains GPU_Computing, Device_Management, Multi_GPU
Last Updated 2026-02-15 00:00 GMT

Overview

Device management handles the allocation and tracking of GPU resources across one or more CUDA devices, providing a per-device context that encapsulates memory pools, CUDA streams, and peer-to-peer access state.

Description

When running large language models across multiple GPUs, each device needs its own execution context. ExLlamaV2's device management system provides:

  • Per-device context: Each GPU gets an ExLlamaV2DeviceContext object that tracks the device index, available VRAM, allocated scratch space, and CUDA stream. Contexts are created lazily when a device is first used.
  • Scratch memory allocation: A shared scratch buffer is allocated per device for temporary computation results. The required size is computed during model loading based on the layers assigned to each device.
  • CUDA stream management: Each device context maintains its own CUDA stream for asynchronous kernel execution. Operations on different devices can overlap via their independent streams.
  • Peer-to-peer access: For multi-GPU setups, the system configures peer access between devices where available, enabling direct GPU-to-GPU memory transfers without routing through CPU memory.
  • Safe tensor movement: Utility functions handle moving tensors between devices with proper error handling and fallback to CPU staging when direct peer access is unavailable.

Usage

Device management is used when:

  • Multi-GPU model loading: Auto-split distributes model layers across GPUs, each needing its own device context
  • Tensor parallelism: TP operations require coordinated memory access across devices
  • Memory-constrained inference: Tracking per-device VRAM usage to prevent out-of-memory errors
  • Mixed-device operations: Moving tensors between GPUs during forward pass execution

Theoretical Basis

Device Context Lifecycle

# Device contexts are created lazily:
# 1. Model.load() determines layer-to-device mapping
# 2. First access to device d creates ExLlamaV2DeviceContext(d)
# 3. Context allocates scratch buffer on device d
# 4. Peer access is configured between all active devices
# 5. During inference, each layer executes on its assigned device

Peer Access Configuration

# For devices d1, d2 in active set:
# If torch.cuda.can_device_access_peer(d1, d2):
#   torch.cuda.enable_peer_access(d1, d2)
#   # Direct GPU-to-GPU transfers enabled
# Else:
#   # Fallback: tensor.to(cpu).to(target_device)
#   # Higher latency but always works

Related Pages

Implemented By

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment