Principle:Turboderp org Exllamav2 Multi GPU Compatibility
Appearance
| Knowledge Sources | |
|---|---|
| Domains | GPU_Computing, Compatibility, Multi_GPU |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Multi-GPU compatibility provides safe abstractions for tensor movement between CUDA devices, handling edge cases in peer-to-peer access and providing backward-compatible wrappers for PyTorch API changes.
Description
Running models across multiple GPUs requires moving tensors between devices. This is complicated by several factors:
- Peer-to-peer transfer failures: Direct GPU-to-GPU memory copies can fail silently or throw exceptions on certain hardware configurations (e.g., different GPU architectures, PCIe topology limitations). Safe tensor movement functions catch these failures and fall back to CPU-staged transfers.
- PyTorch version compatibility: As PyTorch evolves, APIs change between versions. Compatibility shims wrap these differences, allowing ExLlamaV2 to support multiple PyTorch versions without conditional logic scattered throughout the codebase. Examples include changes to CUDA graph APIs and tensor manipulation functions.
- Non-contiguous tensor handling: When tensors are views or slices, they may not occupy contiguous memory. This can cause issues with certain CUDA operations. The compatibility layer ensures tensors are made contiguous before device transfers when necessary.
- Exception recovery: When peer-to-peer copy fails, the system automatically retries the transfer via CPU staging (tensor.cpu().to(target_device)), which is slower but universally supported.
Usage
Multi-GPU compatibility is used when:
- Auto-split loading: Distributing model layers across GPUs requires moving weight tensors between devices
- Tensor parallelism: TP gather/scatter operations move partial results between devices
- KV cache management: Cache pages may need to be migrated between devices during dynamic batching
- Cross-version deployment: Running the same ExLlamaV2 installation across different PyTorch versions
Theoretical Basis
Safe Tensor Movement
# safe_move_tensor(tensor, target_device):
# 1. If tensor.device == target_device: return tensor
# 2. Try: return tensor.to(target_device)
# 3. On RuntimeError (peer access failure):
# return tensor.cpu().to(target_device)
# This adds latency but guarantees success
GPU Peer Copy Protocol
# gpu_peer_copy(target, source):
# Precondition: target and source on different CUDA devices
# 1. Ensure both tensors are contiguous
# 2. Try direct copy: target.copy_(source)
# 3. On failure: target.copy_(source.cpu())
# Used for weight transfers during model loading
Related Pages
Implemented By
Related Principles
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment