Principle:Turboderp org Exllamav2 Multi GPU Compatibility

Knowledge Sources	ExLlamaV2
Domains	GPU_Computing, Compatibility, Multi_GPU
Last Updated	2026-02-15 00:00 GMT

Overview

Multi-GPU compatibility provides safe abstractions for tensor movement between CUDA devices, handling edge cases in peer-to-peer access and providing backward-compatible wrappers for PyTorch API changes.

Description

Running models across multiple GPUs requires moving tensors between devices. This is complicated by several factors:

Peer-to-peer transfer failures: Direct GPU-to-GPU memory copies can fail silently or throw exceptions on certain hardware configurations (e.g., different GPU architectures, PCIe topology limitations). Safe tensor movement functions catch these failures and fall back to CPU-staged transfers.

PyTorch version compatibility: As PyTorch evolves, APIs change between versions. Compatibility shims wrap these differences, allowing ExLlamaV2 to support multiple PyTorch versions without conditional logic scattered throughout the codebase. Examples include changes to CUDA graph APIs and tensor manipulation functions.

Non-contiguous tensor handling: When tensors are views or slices, they may not occupy contiguous memory. This can cause issues with certain CUDA operations. The compatibility layer ensures tensors are made contiguous before device transfers when necessary.

Exception recovery: When peer-to-peer copy fails, the system automatically retries the transfer via CPU staging (tensor.cpu().to(target_device)), which is slower but universally supported.

Usage

Multi-GPU compatibility is used when:

Auto-split loading: Distributing model layers across GPUs requires moving weight tensors between devices
Tensor parallelism: TP gather/scatter operations move partial results between devices
KV cache management: Cache pages may need to be migrated between devices during dynamic batching
Cross-version deployment: Running the same ExLlamaV2 installation across different PyTorch versions

Theoretical Basis

Safe Tensor Movement

# safe_move_tensor(tensor, target_device):
# 1. If tensor.device == target_device: return tensor
# 2. Try: return tensor.to(target_device)
# 3. On RuntimeError (peer access failure):
#    return tensor.cpu().to(target_device)
# This adds latency but guarantees success

GPU Peer Copy Protocol

# gpu_peer_copy(target, source):
# Precondition: target and source on different CUDA devices
# 1. Ensure both tensors are contiguous
# 2. Try direct copy: target.copy_(source)
# 3. On failure: target.copy_(source.cpu())
# Used for weight transfers during model loading

Related Pages

Implemented By

Implementation:Turboderp_org_Exllamav2_Compat

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment