Implementation:Turboderp org Exllamav2 Ext TP
| Knowledge Sources | |
|---|---|
| Domains | Tensor_Parallelism, CUDA, Multi_GPU |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
C++ extension providing the tensor parallelism communication infrastructure for multi-GPU inference, including context management, broadcast, gather, barrier synchronization, and all-reduce operations.
Description
ext_tp.cpp implements the low-level multi-GPU communication primitives used by ExLlamaV2's tensor-parallel inference mode. All inter-device data movement flows through pinned host memory buffers, avoiding the need for NCCL or peer-to-peer GPU access.
ExtTPContext Class
The ExtTPContext class encapsulates the full tensor-parallel runtime state:
- Split configurations -- Five split types define how tensors are partitioned across devices:
- kv_split -- Key/value head splits for attention
- id_split -- Identity/column splits for intermediate dimensions
- vc_split -- Vocabulary/column splits
- rs_split -- Row splits for reduction
- q_split -- Query head splits for attention
Each split is a vector of std::tuple<int, int, int> representing (device_id, offset, end_column).
- Pinned memory buffers -- Host-pinned temporary tensors used as staging areas for device-to-host-to-device transfers.
- CUDA streams and events -- Per-device streams for asynchronous operations and events for cross-device synchronization.
- Thread pool -- When
TP_MULTITHREADEDis defined, a thread pool enables parallel execution across devices. - ExtTPData -- Host-mapped synchronization data structure allocated via
cudaHostAlloc.
Communication Primitives
- make_tp_context -- Creates an ExtTPContext from split configurations, pinned buffers, and stream handles. Returns an opaque
uintptr_thandle.
- tp_broadcast -- Copies a source tensor from one device to all target devices via pinned host memory. The source is first copied device-to-host, then host-to-device for each target. Uses the specified split type to determine target devices and offsets. Ends with a cross-device barrier.
- tp_gather -- Collects partial results from multiple devices into a single buffer. Each device's contribution is copied via
cudaMemcpy2DAsync(strided 2D copy) to the correct offset in the pinned host buffer, accounting for the split layout. Delegates totp_gather_barrierwith a null barrier.
- tp_gather_barrier -- Extended version of tp_gather that supports an explicit Barrier for thread synchronization in multithreaded mode. After gathering to host, optionally redistributes the complete buffer back to all devices using a target split type.
- tp_cross_device_barrier -- Synchronizes all devices using CUDA events. Each device records an event on its stream, then every device waits on every other device's event. Uses a staged approach with configurable stage indices from
ExtTPData.
- tp_all_reduce -- Performs a reduction (sum) across all device tensors via sequential host-mediated accumulation. Each device's tensor is added to a running sum in the residual buffers, flowing through pinned memory. After processing the last device, the final result is broadcast back to all devices.
Usage
Use make_tp_context during model initialization to set up the tensor-parallel runtime. The tp_broadcast, tp_gather, tp_all_reduce, and tp_cross_device_barrier functions are called internally by the tensor-parallel attention (tp_attn_forward_*) and MLP (tp_mlp_forward_) implementations. Direct usage is typically not required at the Python level.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: exllamav2/exllamav2_ext/ext_tp.cpp
- Lines: 1-495
Signature
// Context management
uintptr_t make_tp_context(
std::vector<std::tuple<int, int, int>> kv_split,
std::vector<std::tuple<int, int, int>> id_split,
std::vector<std::tuple<int, int, int>> vc_split,
std::vector<std::tuple<int, int, int>> rs_split,
std::vector<std::tuple<int, int, int>> q_split,
std::vector<torch::Tensor> pinned_temp,
std::vector<uintptr_t> streams
);
void free_tp_context(uintptr_t tp_context);
// Communication primitives
void tp_broadcast(
uintptr_t tp_context,
int buffer,
torch::Tensor source,
int broadcast_type,
const std::vector<torch::Tensor> &targets,
int dim,
int t_device
);
void tp_gather(
uintptr_t tp_context,
int buffer,
const std::vector<torch::Tensor> &inputs,
int broadcast_type,
const std::vector<torch::Tensor> &targets,
int broadcast_type_target,
int dim,
int t_device
);
void tp_gather_barrier(
uintptr_t tp_context,
int buffer,
const std::vector<torch::Tensor> &inputs,
int broadcast_type,
const std::vector<torch::Tensor> &targets,
int broadcast_type_target,
int dim,
int t_device,
Barrier* barrier
);
void tp_cross_device_barrier(
uintptr_t tp_context,
int broadcast_type,
int t_device,
int stage = -1,
int next_stage = -1
);
void tp_all_reduce(
uintptr_t tp_context,
int buffer,
const std::vector<torch::Tensor> &tensors,
const std::vector<torch::Tensor> &residuals
);
Import
from exllamav2.ext import exllamav2_ext as ext_c
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
| kv_split, id_split, vc_split, rs_split, q_split | std::vector<std::tuple<int,int,int>> |
Split configs: (device_id, offset, end_column) per device |
| pinned_temp | std::vector<torch::Tensor> |
Host-pinned staging buffers for inter-device transfers |
| streams | std::vector<uintptr_t> |
Per-device CUDA stream handles (0/NULL for unused device slots) |
| buffer | int | Index into pinned_temp to use for this operation |
| source | torch.Tensor |
Source tensor to broadcast (on any device) |
| broadcast_type | int | Split type selector (BROADCAST_KV, BROADCAST_ID, BROADCAST_VC, BROADCAST_RS, BROADCAST_Q) |
| targets / inputs | std::vector<torch::Tensor> |
Per-device target or input tensors |
| t_device | int | Target device filter; -1 to process all devices |
| barrier | Barrier* |
Thread synchronization barrier (nullptr for single-threaded) |
| tensors / residuals | std::vector<torch::Tensor> |
Per-device tensors and residual accumulators (tp_all_reduce) |
Outputs
| Function | Return | Description |
|---|---|---|
| make_tp_context | uintptr_t |
Opaque handle to the ExtTPContext object |
| tp_broadcast | void | Copies source tensor to all target tensors across devices |
| tp_gather | void | Collects partial tensors into combined host buffer, optionally redistributes |
| tp_gather_barrier | void | Same as tp_gather with explicit thread barrier support |
| tp_cross_device_barrier | void | Synchronizes all device streams via CUDA events |
| tp_all_reduce | void | Accumulates sum across all device tensors into residual buffers |
Usage Examples
from exllamav2.ext import exllamav2_ext as ext_c
# Create tensor-parallel context during model init
tp_handle = ext_c.make_tp_context(
kv_split=[(0, 0, 16), (1, 16, 32)],
id_split=[(0, 0, 2048), (1, 2048, 4096)],
vc_split=[(0, 0, 16000), (1, 16000, 32000)],
rs_split=[(0, 0, 2048), (1, 2048, 4096)],
q_split=[(0, 0, 16), (1, 16, 32)],
pinned_temp=[pinned_buf_0, pinned_buf_1],
streams=[stream_handle_0, stream_handle_1]
)
# Broadcast hidden states from device 0 to all devices
ext_c.tp_broadcast(
tp_handle, buffer=0, source=hidden_states,
broadcast_type=BROADCAST_ID, targets=per_device_buffers,
dim=1, t_device=-1
)
# All-reduce partial outputs across devices
ext_c.tp_all_reduce(
tp_handle, buffer=0,
tensors=per_device_outputs,
residuals=per_device_residuals
)
# Clean up
ext_c.free_tp_context(tp_handle)