Implementation:Turboderp org Exllamav2 Ext Bindings

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Python_Bindings, C_Extension
Last Updated	2026-02-15 00:00 GMT

Overview

PyBind11 module entry point that registers all 50+ C++ extension functions as Python-callable methods under the exllamav2_ext module.

Description

ext_bindings.cpp is the single PYBIND11_MODULE definition file that exposes the entire ExLlamaV2 C++ extension library to Python. It includes all extension headers and registers each function via m.def(). The bindings are organized by functional category:

Quantization

pack_rows_4 -- Pack tensor rows into 4-bit format
pack_columns -- Pack tensor columns
quantize -- Quantize tensor values
quantize_err -- Quantize with error tracking
quantize_range / quantize_range_inplace -- Range-based quantization
sim_anneal -- Simulated annealing for quantization optimization

Sampling

apply_rep_penalty -- Apply repetition penalty to logits
sample_basic -- Basic sampling from logit distribution
logit_filter_exclusive -- Exclusive token filtering on logits
fast_fill_cpu_ones_bool -- Fast CPU boolean tensor fill
fast_fadd_cpu / fast_copy_cpu -- Optimized CPU tensor operations
dump_profile_results -- Dump profiling data
partial_strings_match -- Partial string matching for guided generation

Safetensors / Loader

stloader_read -- Read safetensors file
tensor_remap / tensor_remap_4bit -- Remap tensor layouts

Matrix Operations

make_q_matrix / make_q_matrix_split -- Create quantized matrices
free_q_matrix -- Free quantized matrix
reconstruct -- Dequantize to FP16
gemm_half_q_half / gemm_half_q_half_tp -- Quantized GEMM (single and TP)
matrix_fp16_to_q4 / matrix_q4_to_fp16 -- Q4 format conversions
make_group_map -- Build quantization group mapping

Attention

make_q_attn / free_q_attn -- Create/destroy attention layers
q_attn_forward_1 / q_attn_forward_2 -- Two-phase attention forward
q_attn_set_loras -- Configure attention LoRA adapters
tp_attn_forward_paged_ / tp_attn_forward_ -- Tensor-parallel attention
set_flash_attn_func -- Register Flash Attention function

MLP

make_q_mlp / free_q_mlp -- Create/destroy MLP layers
make_q_moe_mlp / free_q_moe_mlp -- Create/destroy MoE MLP layers
q_mlp_forward_ -- MLP forward pass
q_mlp_set_loras -- Configure MLP LoRA adapters
q_moe_mlp_forward_ -- MoE MLP forward pass
tp_mlp_forward_ -- Tensor-parallel MLP forward

Cache

fp16_to_fp8 / fp8_to_fp16 -- FP8 cache conversion
fp16_to_q_kv / q_to_fp16_kv -- Quantized KV cache conversion
count_match -- Cache prefix matching
cache_rotate -- Rotate cache entries

Hadamard

had_paley / had_paley2 -- Hadamard transform operations

GEMM

gemm_half_half_half -- Dense FP16 GEMM

Normalization

rms_norm / rms_norm_ / rms_norm_tp -- RMS normalization variants
layer_norm / layer_norm_ -- Layer normalization
head_norm / head_norm_ -- Per-head normalization

RoPE

rope_ -- Rotary position embedding application
gen_mrope_pos_ids -- Generate multi-modal RoPE position IDs

Element-wise

softcap_ -- Soft capping of logits

Tensor Parallelism

make_tp_context / free_tp_context -- Create/destroy TP context
tp_broadcast / tp_gather -- Inter-device communication
tp_cross_device_barrier -- Device synchronization
tp_all_reduce -- Cross-device reduction

Usage

This file is compiled as part of the exllamav2_ext PyTorch C++ extension. All functions become available after importing the module. The registered names match their C++ function names exactly, so ext_c.make_q_matrix(...) in Python calls the C++ make_q_matrix(...) function directly.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/exllamav2_ext/ext_bindings.cpp
Lines: 1-138

Signature

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
{
    // Quantization
    m.def("pack_rows_4", &pack_rows_4, "pack_rows_4");
    m.def("pack_columns", &pack_columns, "pack_columns");
    m.def("quantize", &quantize, "quantize");
    m.def("quantize_err", &quantize_err, "quantize_err");
    m.def("quantize_range", &quantize_range, "quantize_range");
    m.def("quantize_range_inplace", &quantize_range_inplace, "quantize_range_inplace");
    m.def("sim_anneal", &sim_anneal, "sim_anneal");

    // Sampling
    m.def("apply_rep_penalty", &apply_rep_penalty, "apply_rep_penalty");
    m.def("sample_basic", &sample_basic, "sample_basic");
    m.def("logit_filter_exclusive", &logit_filter_exclusive, "logit_filter_exclusive");
    // ... (50+ total bindings across all categories)

    // Matrix, Attention, MLP, Cache, Norm, RoPE, TP ...
}

Import

from exllamav2.ext import exllamav2_ext as ext_c

I/O Contract

Inputs

Category	Representative Functions	Input Types
Quantization	pack_rows_4, quantize, sim_anneal	`torch.Tensor` (various dtypes), int/float parameters
Sampling	apply_rep_penalty, sample_basic	`torch.Tensor` (kFloat/kHalf logits), penalty/temperature floats
Matrix	make_q_matrix, gemm_half_q_half	`torch.Tensor` (kInt weights, kHalf activations), `uintptr_t` handles
Attention	make_q_attn, q_attn_forward_1/2	`torch.Tensor` (kHalf), `uintptr_t` QMatrix handles, int dims
MLP	make_q_mlp, q_mlp_forward_	`torch.Tensor` (kHalf), `uintptr_t` QMatrix handles
Cache	fp16_to_fp8, fp16_to_q_kv	`torch.Tensor` (kHalf/kUInt8), int batch/offset params
TP	make_tp_context, tp_broadcast	Split tuples, `torch.Tensor` vectors, stream handles

Outputs

Category	Representative Functions	Output Types
Quantization	quantize, pack_rows_4	void (in-place modification of output tensors)
Sampling	sample_basic	`torch.Tensor` (sampled token IDs)
Matrix	make_q_matrix	`uintptr_t` (opaque handle)
Attention	make_q_attn	`uintptr_t` (opaque handle); forward functions are void (in-place)
MLP	make_q_mlp	`uintptr_t` (opaque handle); forward functions are void (in-place)
Cache	count_match	int (match length)
TP	make_tp_context	`uintptr_t` (opaque handle)

Usage Examples

from exllamav2.ext import exllamav2_ext as ext_c

# All C++ functions are directly accessible as module methods:

# Quantization
ext_c.quantize(input_tensor, output_tensor, scale, qzero, maxq)

# Sampling
token_ids = ext_c.sample_basic(logits, temperature, top_k, top_p, typical, random)

# Matrix operations
q_handle = ext_c.make_q_matrix(q_weight, q_perm, q_invperm, q_scale, ...)
ext_c.gemm_half_q_half(input_fp16, q_handle, output_fp16, False)

# Attention
attn_handle = ext_c.make_q_attn(layernorm, layernorm_bias, ...)
ext_c.q_attn_forward_1(attn_handle, x, batch_size, q_len, ...)

# Normalization
ext_c.rms_norm(input_tensor, weight, output_tensor, epsilon)

# RoPE
ext_c.rope_(q_tensor, sin, cos, past_len, num_heads, head_dim)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment