Implementation:Turboderp org Exllamav2 Compat

Knowledge Sources	Turboderp_org_Exllamav2
Domains	GPU_Management, Compatibility
Last Updated	2026-02-15 00:00 GMT

Overview

Cross-GPU tensor compatibility module that provides safe tensor movement between CUDA devices, working around driver-level peer-to-peer copy failures by testing P2P capability and falling back through CPU when necessary.

Description

compat.py contains two primary functions for reliable multi-GPU tensor transfers:

test_gpu_peer_copy(device_a, device_b) -- Tests whether direct GPU-to-GPU peer copy works between two CUDA devices. It creates a test tensor with known values on device_a, copies it to device_b and back, then verifies the round-trip. Results are cached in a global matrix (tested_peer_copy) using tri-state values: 0 (untested), 1 (P2P works), -1 (P2P broken). The device indices are always ordered so that idx_a <= idx_b to avoid duplicate tests.

safe_move_tensor(tensor, device, non_blocking=False) -- Moves a tensor (or tuple of tensors) to the target device using the most efficient available path. The movement strategy follows this priority:
- No-op -- If the tensor is already on the target device, return immediately.
- CPU transfers -- Copies to/from system RAM always use tensor.to() directly, optionally using CUDA streams for asynchronous transfer with explicit synchronisation.
- P2P GPU transfer -- If test_gpu_peer_copy confirms direct copy works, uses tensor.to() with CUDA streams on both source and destination devices.
- CPU fallback -- If P2P copy is broken, the tensor is first moved to CPU with synchronisation, then from CPU to the target GPU.

The module also provides a pairwise() polyfill for Python versions below 3.10, emulating itertools.pairwise using itertools.tee.

Usage

Use safe_move_tensor whenever moving tensors between CUDA devices in a multi-GPU setup. It is called internally throughout ExLlamaV2 during model loading, autosplit distribution, and inference to guarantee correct data transfer regardless of the GPU interconnect topology.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/compat.py
Lines: L1-141

Signature

def test_gpu_peer_copy(
    device_a: torch.Device,
    device_b: torch.Device
) -> bool:
    ...

def safe_move_tensor(
    tensor: torch.Tensor | tuple[torch.Tensor],
    device: torch.Device | str | int,
    non_blocking: bool = False
) -> torch.Tensor | tuple[torch.Tensor]:
    ...

Import

from exllamav2.compat import safe_move_tensor, test_gpu_peer_copy

I/O Contract

Inputs

Name	Type	Required	Description
device_a	torch.Device	Yes (test_gpu_peer_copy)	First CUDA device to test for peer-to-peer copy
device_b	torch.Device	Yes (test_gpu_peer_copy)	Second CUDA device to test for peer-to-peer copy
tensor	torch.Tensor or tuple[torch.Tensor]	Yes (safe_move_tensor)	The tensor(s) to move to the target device
device	torch.Device, str, or int	Yes (safe_move_tensor)	Target device specification (e.g., "cuda:1", torch.device("cuda:0"), or integer index)
non_blocking	bool	No (default False)	If True, allows asynchronous transfers without explicit synchronisation on the P2P path

Outputs

Name	Type	Description
peer_copy_ok	bool	From test_gpu_peer_copy: True if direct GPU-to-GPU copy succeeded, False otherwise
moved_tensor	torch.Tensor or tuple[torch.Tensor]	From safe_move_tensor: the tensor(s) on the target device with identical data

Usage Examples

Safe Multi-GPU Tensor Transfer

from exllamav2.compat import safe_move_tensor

import torch

# Move a tensor from GPU 0 to GPU 1 safely
tensor_gpu0 = torch.randn(1024, 4096, device="cuda:0")
tensor_gpu1 = safe_move_tensor(tensor_gpu0, "cuda:1")

# Move a tuple of tensors
weight, bias = torch.randn(4096, device="cuda:0"), torch.randn(4096, device="cuda:0")
weight_gpu1, bias_gpu1 = safe_move_tensor((weight, bias), "cuda:1")

Testing Peer-to-Peer Copy Capability

from exllamav2.compat import test_gpu_peer_copy
import torch

device_0 = torch.device("cuda:0")
device_1 = torch.device("cuda:1")

if test_gpu_peer_copy(device_0, device_1):
    print("Direct P2P copy between GPU 0 and GPU 1 is supported")
else:
    print("P2P copy failed; transfers will route through CPU")

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Multi_GPU_Compatibility

Requires Environment

Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime

Used By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment