Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Turboderp org Exllamav2 Multi GPU Compatibility

From Leeroopedia
Revision as of 17:50, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Turboderp_org_Exllamav2_Multi_GPU_Compatibility.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains GPU_Computing, Compatibility, Multi_GPU
Last Updated 2026-02-15 00:00 GMT

Overview

Multi-GPU compatibility provides safe abstractions for tensor movement between CUDA devices, handling edge cases in peer-to-peer access and providing backward-compatible wrappers for PyTorch API changes.

Description

Running models across multiple GPUs requires moving tensors between devices. This is complicated by several factors:

  • Peer-to-peer transfer failures: Direct GPU-to-GPU memory copies can fail silently or throw exceptions on certain hardware configurations (e.g., different GPU architectures, PCIe topology limitations). Safe tensor movement functions catch these failures and fall back to CPU-staged transfers.
  • PyTorch version compatibility: As PyTorch evolves, APIs change between versions. Compatibility shims wrap these differences, allowing ExLlamaV2 to support multiple PyTorch versions without conditional logic scattered throughout the codebase. Examples include changes to CUDA graph APIs and tensor manipulation functions.
  • Non-contiguous tensor handling: When tensors are views or slices, they may not occupy contiguous memory. This can cause issues with certain CUDA operations. The compatibility layer ensures tensors are made contiguous before device transfers when necessary.
  • Exception recovery: When peer-to-peer copy fails, the system automatically retries the transfer via CPU staging (tensor.cpu().to(target_device)), which is slower but universally supported.

Usage

Multi-GPU compatibility is used when:

  • Auto-split loading: Distributing model layers across GPUs requires moving weight tensors between devices
  • Tensor parallelism: TP gather/scatter operations move partial results between devices
  • KV cache management: Cache pages may need to be migrated between devices during dynamic batching
  • Cross-version deployment: Running the same ExLlamaV2 installation across different PyTorch versions

Theoretical Basis

Safe Tensor Movement

# safe_move_tensor(tensor, target_device):
# 1. If tensor.device == target_device: return tensor
# 2. Try: return tensor.to(target_device)
# 3. On RuntimeError (peer access failure):
#    return tensor.cpu().to(target_device)
# This adds latency but guarantees success

GPU Peer Copy Protocol

# gpu_peer_copy(target, source):
# Precondition: target and source on different CUDA devices
# 1. Ensure both tensors are contiguous
# 2. Try direct copy: target.copy_(source)
# 3. On failure: target.copy_(source.cpu())
# Used for weight transfers during model loading

Related Pages

Implemented By

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment