Implementation:Turboderp org Exllamav2 Ext Bindings
| Knowledge Sources | |
|---|---|
| Domains | Python_Bindings, C_Extension |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
PyBind11 module entry point that registers all 50+ C++ extension functions as Python-callable methods under the exllamav2_ext module.
Description
ext_bindings.cpp is the single PYBIND11_MODULE definition file that exposes the entire ExLlamaV2 C++ extension library to Python. It includes all extension headers and registers each function via m.def(). The bindings are organized by functional category:
Quantization
- pack_rows_4 -- Pack tensor rows into 4-bit format
- pack_columns -- Pack tensor columns
- quantize -- Quantize tensor values
- quantize_err -- Quantize with error tracking
- quantize_range / quantize_range_inplace -- Range-based quantization
- sim_anneal -- Simulated annealing for quantization optimization
Sampling
- apply_rep_penalty -- Apply repetition penalty to logits
- sample_basic -- Basic sampling from logit distribution
- logit_filter_exclusive -- Exclusive token filtering on logits
- fast_fill_cpu_ones_bool -- Fast CPU boolean tensor fill
- fast_fadd_cpu / fast_copy_cpu -- Optimized CPU tensor operations
- dump_profile_results -- Dump profiling data
- partial_strings_match -- Partial string matching for guided generation
Safetensors / Loader
- stloader_read -- Read safetensors file
- tensor_remap / tensor_remap_4bit -- Remap tensor layouts
Matrix Operations
- make_q_matrix / make_q_matrix_split -- Create quantized matrices
- free_q_matrix -- Free quantized matrix
- reconstruct -- Dequantize to FP16
- gemm_half_q_half / gemm_half_q_half_tp -- Quantized GEMM (single and TP)
- matrix_fp16_to_q4 / matrix_q4_to_fp16 -- Q4 format conversions
- make_group_map -- Build quantization group mapping
Attention
- make_q_attn / free_q_attn -- Create/destroy attention layers
- q_attn_forward_1 / q_attn_forward_2 -- Two-phase attention forward
- q_attn_set_loras -- Configure attention LoRA adapters
- tp_attn_forward_paged_ / tp_attn_forward_ -- Tensor-parallel attention
- set_flash_attn_func -- Register Flash Attention function
MLP
- make_q_mlp / free_q_mlp -- Create/destroy MLP layers
- make_q_moe_mlp / free_q_moe_mlp -- Create/destroy MoE MLP layers
- q_mlp_forward_ -- MLP forward pass
- q_mlp_set_loras -- Configure MLP LoRA adapters
- q_moe_mlp_forward_ -- MoE MLP forward pass
- tp_mlp_forward_ -- Tensor-parallel MLP forward
Cache
- fp16_to_fp8 / fp8_to_fp16 -- FP8 cache conversion
- fp16_to_q_kv / q_to_fp16_kv -- Quantized KV cache conversion
- count_match -- Cache prefix matching
- cache_rotate -- Rotate cache entries
Hadamard
- had_paley / had_paley2 -- Hadamard transform operations
GEMM
- gemm_half_half_half -- Dense FP16 GEMM
Normalization
- rms_norm / rms_norm_ / rms_norm_tp -- RMS normalization variants
- layer_norm / layer_norm_ -- Layer normalization
- head_norm / head_norm_ -- Per-head normalization
RoPE
- rope_ -- Rotary position embedding application
- gen_mrope_pos_ids -- Generate multi-modal RoPE position IDs
Element-wise
- softcap_ -- Soft capping of logits
Tensor Parallelism
- make_tp_context / free_tp_context -- Create/destroy TP context
- tp_broadcast / tp_gather -- Inter-device communication
- tp_cross_device_barrier -- Device synchronization
- tp_all_reduce -- Cross-device reduction
Usage
This file is compiled as part of the exllamav2_ext PyTorch C++ extension. All functions become available after importing the module. The registered names match their C++ function names exactly, so ext_c.make_q_matrix(...) in Python calls the C++ make_q_matrix(...) function directly.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: exllamav2/exllamav2_ext/ext_bindings.cpp
- Lines: 1-138
Signature
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
{
// Quantization
m.def("pack_rows_4", &pack_rows_4, "pack_rows_4");
m.def("pack_columns", &pack_columns, "pack_columns");
m.def("quantize", &quantize, "quantize");
m.def("quantize_err", &quantize_err, "quantize_err");
m.def("quantize_range", &quantize_range, "quantize_range");
m.def("quantize_range_inplace", &quantize_range_inplace, "quantize_range_inplace");
m.def("sim_anneal", &sim_anneal, "sim_anneal");
// Sampling
m.def("apply_rep_penalty", &apply_rep_penalty, "apply_rep_penalty");
m.def("sample_basic", &sample_basic, "sample_basic");
m.def("logit_filter_exclusive", &logit_filter_exclusive, "logit_filter_exclusive");
// ... (50+ total bindings across all categories)
// Matrix, Attention, MLP, Cache, Norm, RoPE, TP ...
}
Import
from exllamav2.ext import exllamav2_ext as ext_c
I/O Contract
Inputs
| Category | Representative Functions | Input Types |
|---|---|---|
| Quantization | pack_rows_4, quantize, sim_anneal | torch.Tensor (various dtypes), int/float parameters
|
| Sampling | apply_rep_penalty, sample_basic | torch.Tensor (kFloat/kHalf logits), penalty/temperature floats
|
| Matrix | make_q_matrix, gemm_half_q_half | torch.Tensor (kInt weights, kHalf activations), uintptr_t handles
|
| Attention | make_q_attn, q_attn_forward_1/2 | torch.Tensor (kHalf), uintptr_t QMatrix handles, int dims
|
| MLP | make_q_mlp, q_mlp_forward_ | torch.Tensor (kHalf), uintptr_t QMatrix handles
|
| Cache | fp16_to_fp8, fp16_to_q_kv | torch.Tensor (kHalf/kUInt8), int batch/offset params
|
| TP | make_tp_context, tp_broadcast | Split tuples, torch.Tensor vectors, stream handles
|
Outputs
| Category | Representative Functions | Output Types |
|---|---|---|
| Quantization | quantize, pack_rows_4 | void (in-place modification of output tensors) |
| Sampling | sample_basic | torch.Tensor (sampled token IDs)
|
| Matrix | make_q_matrix | uintptr_t (opaque handle)
|
| Attention | make_q_attn | uintptr_t (opaque handle); forward functions are void (in-place)
|
| MLP | make_q_mlp | uintptr_t (opaque handle); forward functions are void (in-place)
|
| Cache | count_match | int (match length) |
| TP | make_tp_context | uintptr_t (opaque handle)
|
Usage Examples
from exllamav2.ext import exllamav2_ext as ext_c
# All C++ functions are directly accessible as module methods:
# Quantization
ext_c.quantize(input_tensor, output_tensor, scale, qzero, maxq)
# Sampling
token_ids = ext_c.sample_basic(logits, temperature, top_k, top_p, typical, random)
# Matrix operations
q_handle = ext_c.make_q_matrix(q_weight, q_perm, q_invperm, q_scale, ...)
ext_c.gemm_half_q_half(input_fp16, q_handle, output_fp16, False)
# Attention
attn_handle = ext_c.make_q_attn(layernorm, layernorm_bias, ...)
ext_c.q_attn_forward_1(attn_handle, x, batch_size, q_len, ...)
# Normalization
ext_c.rms_norm(input_tensor, weight, output_tensor, epsilon)
# RoPE
ext_c.rope_(q_tensor, sin, cos, past_len, num_heads, head_dim)