Implementation:Turboderp org Exllamav2 Ext TP

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Tensor_Parallelism, CUDA, Multi_GPU
Last Updated	2026-02-15 00:00 GMT

Overview

C++ extension providing the tensor parallelism communication infrastructure for multi-GPU inference, including context management, broadcast, gather, barrier synchronization, and all-reduce operations.

Description

ext_tp.cpp implements the low-level multi-GPU communication primitives used by ExLlamaV2's tensor-parallel inference mode. All inter-device data movement flows through pinned host memory buffers, avoiding the need for NCCL or peer-to-peer GPU access.

ExtTPContext Class

The ExtTPContext class encapsulates the full tensor-parallel runtime state:

Split configurations -- Five split types define how tensors are partitioned across devices:
- kv_split -- Key/value head splits for attention
- id_split -- Identity/column splits for intermediate dimensions
- vc_split -- Vocabulary/column splits
- rs_split -- Row splits for reduction
- q_split -- Query head splits for attention

Each split is a vector of std::tuple<int, int, int> representing (device_id, offset, end_column).

Pinned memory buffers -- Host-pinned temporary tensors used as staging areas for device-to-host-to-device transfers.
CUDA streams and events -- Per-device streams for asynchronous operations and events for cross-device synchronization.
Thread pool -- When TP_MULTITHREADED is defined, a thread pool enables parallel execution across devices.
ExtTPData -- Host-mapped synchronization data structure allocated via cudaHostAlloc.

Communication Primitives

make_tp_context -- Creates an ExtTPContext from split configurations, pinned buffers, and stream handles. Returns an opaque uintptr_t handle.

tp_broadcast -- Copies a source tensor from one device to all target devices via pinned host memory. The source is first copied device-to-host, then host-to-device for each target. Uses the specified split type to determine target devices and offsets. Ends with a cross-device barrier.

tp_gather -- Collects partial results from multiple devices into a single buffer. Each device's contribution is copied via cudaMemcpy2DAsync (strided 2D copy) to the correct offset in the pinned host buffer, accounting for the split layout. Delegates to tp_gather_barrier with a null barrier.

tp_gather_barrier -- Extended version of tp_gather that supports an explicit Barrier for thread synchronization in multithreaded mode. After gathering to host, optionally redistributes the complete buffer back to all devices using a target split type.

tp_cross_device_barrier -- Synchronizes all devices using CUDA events. Each device records an event on its stream, then every device waits on every other device's event. Uses a staged approach with configurable stage indices from ExtTPData.

tp_all_reduce -- Performs a reduction (sum) across all device tensors via sequential host-mediated accumulation. Each device's tensor is added to a running sum in the residual buffers, flowing through pinned memory. After processing the last device, the final result is broadcast back to all devices.

Usage

Use make_tp_context during model initialization to set up the tensor-parallel runtime. The tp_broadcast, tp_gather, tp_all_reduce, and tp_cross_device_barrier functions are called internally by the tensor-parallel attention (tp_attn_forward_*) and MLP (tp_mlp_forward_) implementations. Direct usage is typically not required at the Python level.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/exllamav2_ext/ext_tp.cpp
Lines: 1-495

Signature

// Context management
uintptr_t make_tp_context(
    std::vector<std::tuple<int, int, int>> kv_split,
    std::vector<std::tuple<int, int, int>> id_split,
    std::vector<std::tuple<int, int, int>> vc_split,
    std::vector<std::tuple<int, int, int>> rs_split,
    std::vector<std::tuple<int, int, int>> q_split,
    std::vector<torch::Tensor> pinned_temp,
    std::vector<uintptr_t> streams
);

void free_tp_context(uintptr_t tp_context);

// Communication primitives
void tp_broadcast(
    uintptr_t tp_context,
    int buffer,
    torch::Tensor source,
    int broadcast_type,
    const std::vector<torch::Tensor> &targets,
    int dim,
    int t_device
);

void tp_gather(
    uintptr_t tp_context,
    int buffer,
    const std::vector<torch::Tensor> &inputs,
    int broadcast_type,
    const std::vector<torch::Tensor> &targets,
    int broadcast_type_target,
    int dim,
    int t_device
);

void tp_gather_barrier(
    uintptr_t tp_context,
    int buffer,
    const std::vector<torch::Tensor> &inputs,
    int broadcast_type,
    const std::vector<torch::Tensor> &targets,
    int broadcast_type_target,
    int dim,
    int t_device,
    Barrier* barrier
);

void tp_cross_device_barrier(
    uintptr_t tp_context,
    int broadcast_type,
    int t_device,
    int stage = -1,
    int next_stage = -1
);

void tp_all_reduce(
    uintptr_t tp_context,
    int buffer,
    const std::vector<torch::Tensor> &tensors,
    const std::vector<torch::Tensor> &residuals
);

Import

from exllamav2.ext import exllamav2_ext as ext_c

I/O Contract

Inputs

Parameter	Type	Description
kv_split, id_split, vc_split, rs_split, q_split	`std::vector<std::tuple<int,int,int>>`	Split configs: (device_id, offset, end_column) per device
pinned_temp	`std::vector<torch::Tensor>`	Host-pinned staging buffers for inter-device transfers
streams	`std::vector<uintptr_t>`	Per-device CUDA stream handles (0/NULL for unused device slots)
buffer	int	Index into pinned_temp to use for this operation
source	`torch.Tensor`	Source tensor to broadcast (on any device)
broadcast_type	int	Split type selector (BROADCAST_KV, BROADCAST_ID, BROADCAST_VC, BROADCAST_RS, BROADCAST_Q)
targets / inputs	`std::vector<torch::Tensor>`	Per-device target or input tensors
t_device	int	Target device filter; -1 to process all devices
barrier	`Barrier*`	Thread synchronization barrier (nullptr for single-threaded)
tensors / residuals	`std::vector<torch::Tensor>`	Per-device tensors and residual accumulators (tp_all_reduce)

Outputs

Function	Return	Description
make_tp_context	`uintptr_t`	Opaque handle to the ExtTPContext object
tp_broadcast	void	Copies source tensor to all target tensors across devices
tp_gather	void	Collects partial tensors into combined host buffer, optionally redistributes
tp_gather_barrier	void	Same as tp_gather with explicit thread barrier support
tp_cross_device_barrier	void	Synchronizes all device streams via CUDA events
tp_all_reduce	void	Accumulates sum across all device tensors into residual buffers

Usage Examples

from exllamav2.ext import exllamav2_ext as ext_c

# Create tensor-parallel context during model init
tp_handle = ext_c.make_tp_context(
    kv_split=[(0, 0, 16), (1, 16, 32)],
    id_split=[(0, 0, 2048), (1, 2048, 4096)],
    vc_split=[(0, 0, 16000), (1, 16000, 32000)],
    rs_split=[(0, 0, 2048), (1, 2048, 4096)],
    q_split=[(0, 0, 16), (1, 16, 32)],
    pinned_temp=[pinned_buf_0, pinned_buf_1],
    streams=[stream_handle_0, stream_handle_1]
)

# Broadcast hidden states from device 0 to all devices
ext_c.tp_broadcast(
    tp_handle, buffer=0, source=hidden_states,
    broadcast_type=BROADCAST_ID, targets=per_device_buffers,
    dim=1, t_device=-1
)

# All-reduce partial outputs across devices
ext_c.tp_all_reduce(
    tp_handle, buffer=0,
    tensors=per_device_outputs,
    residuals=per_device_residuals
)

# Clean up
ext_c.free_tp_context(tp_handle)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment