Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Compat

From Leeroopedia
Knowledge Sources
Domains GPU_Management, Compatibility
Last Updated 2026-02-15 00:00 GMT

Overview

Cross-GPU tensor compatibility module that provides safe tensor movement between CUDA devices, working around driver-level peer-to-peer copy failures by testing P2P capability and falling back through CPU when necessary.

Description

compat.py contains two primary functions for reliable multi-GPU tensor transfers:

  • test_gpu_peer_copy(device_a, device_b) -- Tests whether direct GPU-to-GPU peer copy works between two CUDA devices. It creates a test tensor with known values on device_a, copies it to device_b and back, then verifies the round-trip. Results are cached in a global matrix (tested_peer_copy) using tri-state values: 0 (untested), 1 (P2P works), -1 (P2P broken). The device indices are always ordered so that idx_a <= idx_b to avoid duplicate tests.
  • safe_move_tensor(tensor, device, non_blocking=False) -- Moves a tensor (or tuple of tensors) to the target device using the most efficient available path. The movement strategy follows this priority:
    • No-op -- If the tensor is already on the target device, return immediately.
    • CPU transfers -- Copies to/from system RAM always use tensor.to() directly, optionally using CUDA streams for asynchronous transfer with explicit synchronisation.
    • P2P GPU transfer -- If test_gpu_peer_copy confirms direct copy works, uses tensor.to() with CUDA streams on both source and destination devices.
    • CPU fallback -- If P2P copy is broken, the tensor is first moved to CPU with synchronisation, then from CPU to the target GPU.

The module also provides a pairwise() polyfill for Python versions below 3.10, emulating itertools.pairwise using itertools.tee.

Usage

Use safe_move_tensor whenever moving tensors between CUDA devices in a multi-GPU setup. It is called internally throughout ExLlamaV2 during model loading, autosplit distribution, and inference to guarantee correct data transfer regardless of the GPU interconnect topology.

Code Reference

Source Location

Signature

def test_gpu_peer_copy(
    device_a: torch.Device,
    device_b: torch.Device
) -> bool:
    ...

def safe_move_tensor(
    tensor: torch.Tensor | tuple[torch.Tensor],
    device: torch.Device | str | int,
    non_blocking: bool = False
) -> torch.Tensor | tuple[torch.Tensor]:
    ...

Import

from exllamav2.compat import safe_move_tensor, test_gpu_peer_copy

I/O Contract

Inputs

Name Type Required Description
device_a torch.Device Yes (test_gpu_peer_copy) First CUDA device to test for peer-to-peer copy
device_b torch.Device Yes (test_gpu_peer_copy) Second CUDA device to test for peer-to-peer copy
tensor torch.Tensor or tuple[torch.Tensor] Yes (safe_move_tensor) The tensor(s) to move to the target device
device torch.Device, str, or int Yes (safe_move_tensor) Target device specification (e.g., "cuda:1", torch.device("cuda:0"), or integer index)
non_blocking bool No (default False) If True, allows asynchronous transfers without explicit synchronisation on the P2P path

Outputs

Name Type Description
peer_copy_ok bool From test_gpu_peer_copy: True if direct GPU-to-GPU copy succeeded, False otherwise
moved_tensor torch.Tensor or tuple[torch.Tensor] From safe_move_tensor: the tensor(s) on the target device with identical data

Usage Examples

Safe Multi-GPU Tensor Transfer

from exllamav2.compat import safe_move_tensor

import torch

# Move a tensor from GPU 0 to GPU 1 safely
tensor_gpu0 = torch.randn(1024, 4096, device="cuda:0")
tensor_gpu1 = safe_move_tensor(tensor_gpu0, "cuda:1")

# Move a tuple of tensors
weight, bias = torch.randn(4096, device="cuda:0"), torch.randn(4096, device="cuda:0")
weight_gpu1, bias_gpu1 = safe_move_tensor((weight, bias), "cuda:1")

Testing Peer-to-Peer Copy Capability

from exllamav2.compat import test_gpu_peer_copy
import torch

device_0 = torch.device("cuda:0")
device_1 = torch.device("cuda:1")

if test_gpu_peer_copy(device_0, device_1):
    print("Direct P2P copy between GPU 0 and GPU 1 is supported")
else:
    print("P2P copy failed; transfers will route through CPU")

Related Pages

Implements Principle

Requires Environment

Used By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment