Principle:Turboderp org Exllamav2 Device Management
Appearance
| Knowledge Sources | |
|---|---|
| Domains | GPU_Computing, Device_Management, Multi_GPU |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Device management handles the allocation and tracking of GPU resources across one or more CUDA devices, providing a per-device context that encapsulates memory pools, CUDA streams, and peer-to-peer access state.
Description
When running large language models across multiple GPUs, each device needs its own execution context. ExLlamaV2's device management system provides:
- Per-device context: Each GPU gets an ExLlamaV2DeviceContext object that tracks the device index, available VRAM, allocated scratch space, and CUDA stream. Contexts are created lazily when a device is first used.
- Scratch memory allocation: A shared scratch buffer is allocated per device for temporary computation results. The required size is computed during model loading based on the layers assigned to each device.
- CUDA stream management: Each device context maintains its own CUDA stream for asynchronous kernel execution. Operations on different devices can overlap via their independent streams.
- Peer-to-peer access: For multi-GPU setups, the system configures peer access between devices where available, enabling direct GPU-to-GPU memory transfers without routing through CPU memory.
- Safe tensor movement: Utility functions handle moving tensors between devices with proper error handling and fallback to CPU staging when direct peer access is unavailable.
Usage
Device management is used when:
- Multi-GPU model loading: Auto-split distributes model layers across GPUs, each needing its own device context
- Tensor parallelism: TP operations require coordinated memory access across devices
- Memory-constrained inference: Tracking per-device VRAM usage to prevent out-of-memory errors
- Mixed-device operations: Moving tensors between GPUs during forward pass execution
Theoretical Basis
Device Context Lifecycle
# Device contexts are created lazily:
# 1. Model.load() determines layer-to-device mapping
# 2. First access to device d creates ExLlamaV2DeviceContext(d)
# 3. Context allocates scratch buffer on device d
# 4. Peer access is configured between all active devices
# 5. During inference, each layer executes on its assigned device
Peer Access Configuration
# For devices d1, d2 in active set:
# If torch.cuda.can_device_access_peer(d1, d2):
# torch.cuda.enable_peer_access(d1, d2)
# # Direct GPU-to-GPU transfers enabled
# Else:
# # Fallback: tensor.to(cpu).to(target_device)
# # Higher latency but always works
Related Pages
Implemented By
Related Principles
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment