Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Ext Bindings

From Leeroopedia
Knowledge Sources
Domains Python_Bindings, C_Extension
Last Updated 2026-02-15 00:00 GMT

Overview

PyBind11 module entry point that registers all 50+ C++ extension functions as Python-callable methods under the exllamav2_ext module.

Description

ext_bindings.cpp is the single PYBIND11_MODULE definition file that exposes the entire ExLlamaV2 C++ extension library to Python. It includes all extension headers and registers each function via m.def(). The bindings are organized by functional category:

Quantization

  • pack_rows_4 -- Pack tensor rows into 4-bit format
  • pack_columns -- Pack tensor columns
  • quantize -- Quantize tensor values
  • quantize_err -- Quantize with error tracking
  • quantize_range / quantize_range_inplace -- Range-based quantization
  • sim_anneal -- Simulated annealing for quantization optimization

Sampling

  • apply_rep_penalty -- Apply repetition penalty to logits
  • sample_basic -- Basic sampling from logit distribution
  • logit_filter_exclusive -- Exclusive token filtering on logits
  • fast_fill_cpu_ones_bool -- Fast CPU boolean tensor fill
  • fast_fadd_cpu / fast_copy_cpu -- Optimized CPU tensor operations
  • dump_profile_results -- Dump profiling data
  • partial_strings_match -- Partial string matching for guided generation

Safetensors / Loader

  • stloader_read -- Read safetensors file
  • tensor_remap / tensor_remap_4bit -- Remap tensor layouts

Matrix Operations

  • make_q_matrix / make_q_matrix_split -- Create quantized matrices
  • free_q_matrix -- Free quantized matrix
  • reconstruct -- Dequantize to FP16
  • gemm_half_q_half / gemm_half_q_half_tp -- Quantized GEMM (single and TP)
  • matrix_fp16_to_q4 / matrix_q4_to_fp16 -- Q4 format conversions
  • make_group_map -- Build quantization group mapping

Attention

  • make_q_attn / free_q_attn -- Create/destroy attention layers
  • q_attn_forward_1 / q_attn_forward_2 -- Two-phase attention forward
  • q_attn_set_loras -- Configure attention LoRA adapters
  • tp_attn_forward_paged_ / tp_attn_forward_ -- Tensor-parallel attention
  • set_flash_attn_func -- Register Flash Attention function

MLP

  • make_q_mlp / free_q_mlp -- Create/destroy MLP layers
  • make_q_moe_mlp / free_q_moe_mlp -- Create/destroy MoE MLP layers
  • q_mlp_forward_ -- MLP forward pass
  • q_mlp_set_loras -- Configure MLP LoRA adapters
  • q_moe_mlp_forward_ -- MoE MLP forward pass
  • tp_mlp_forward_ -- Tensor-parallel MLP forward

Cache

  • fp16_to_fp8 / fp8_to_fp16 -- FP8 cache conversion
  • fp16_to_q_kv / q_to_fp16_kv -- Quantized KV cache conversion
  • count_match -- Cache prefix matching
  • cache_rotate -- Rotate cache entries

Hadamard

  • had_paley / had_paley2 -- Hadamard transform operations

GEMM

  • gemm_half_half_half -- Dense FP16 GEMM

Normalization

  • rms_norm / rms_norm_ / rms_norm_tp -- RMS normalization variants
  • layer_norm / layer_norm_ -- Layer normalization
  • head_norm / head_norm_ -- Per-head normalization

RoPE

  • rope_ -- Rotary position embedding application
  • gen_mrope_pos_ids -- Generate multi-modal RoPE position IDs

Element-wise

  • softcap_ -- Soft capping of logits

Tensor Parallelism

  • make_tp_context / free_tp_context -- Create/destroy TP context
  • tp_broadcast / tp_gather -- Inter-device communication
  • tp_cross_device_barrier -- Device synchronization
  • tp_all_reduce -- Cross-device reduction

Usage

This file is compiled as part of the exllamav2_ext PyTorch C++ extension. All functions become available after importing the module. The registered names match their C++ function names exactly, so ext_c.make_q_matrix(...) in Python calls the C++ make_q_matrix(...) function directly.

Code Reference

Source Location

Signature

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
{
    // Quantization
    m.def("pack_rows_4", &pack_rows_4, "pack_rows_4");
    m.def("pack_columns", &pack_columns, "pack_columns");
    m.def("quantize", &quantize, "quantize");
    m.def("quantize_err", &quantize_err, "quantize_err");
    m.def("quantize_range", &quantize_range, "quantize_range");
    m.def("quantize_range_inplace", &quantize_range_inplace, "quantize_range_inplace");
    m.def("sim_anneal", &sim_anneal, "sim_anneal");

    // Sampling
    m.def("apply_rep_penalty", &apply_rep_penalty, "apply_rep_penalty");
    m.def("sample_basic", &sample_basic, "sample_basic");
    m.def("logit_filter_exclusive", &logit_filter_exclusive, "logit_filter_exclusive");
    // ... (50+ total bindings across all categories)

    // Matrix, Attention, MLP, Cache, Norm, RoPE, TP ...
}

Import

from exllamav2.ext import exllamav2_ext as ext_c

I/O Contract

Inputs

Category Representative Functions Input Types
Quantization pack_rows_4, quantize, sim_anneal torch.Tensor (various dtypes), int/float parameters
Sampling apply_rep_penalty, sample_basic torch.Tensor (kFloat/kHalf logits), penalty/temperature floats
Matrix make_q_matrix, gemm_half_q_half torch.Tensor (kInt weights, kHalf activations), uintptr_t handles
Attention make_q_attn, q_attn_forward_1/2 torch.Tensor (kHalf), uintptr_t QMatrix handles, int dims
MLP make_q_mlp, q_mlp_forward_ torch.Tensor (kHalf), uintptr_t QMatrix handles
Cache fp16_to_fp8, fp16_to_q_kv torch.Tensor (kHalf/kUInt8), int batch/offset params
TP make_tp_context, tp_broadcast Split tuples, torch.Tensor vectors, stream handles

Outputs

Category Representative Functions Output Types
Quantization quantize, pack_rows_4 void (in-place modification of output tensors)
Sampling sample_basic torch.Tensor (sampled token IDs)
Matrix make_q_matrix uintptr_t (opaque handle)
Attention make_q_attn uintptr_t (opaque handle); forward functions are void (in-place)
MLP make_q_mlp uintptr_t (opaque handle); forward functions are void (in-place)
Cache count_match int (match length)
TP make_tp_context uintptr_t (opaque handle)

Usage Examples

from exllamav2.ext import exllamav2_ext as ext_c

# All C++ functions are directly accessible as module methods:

# Quantization
ext_c.quantize(input_tensor, output_tensor, scale, qzero, maxq)

# Sampling
token_ids = ext_c.sample_basic(logits, temperature, top_k, top_p, typical, random)

# Matrix operations
q_handle = ext_c.make_q_matrix(q_weight, q_perm, q_invperm, q_scale, ...)
ext_c.gemm_half_q_half(input_fp16, q_handle, output_fp16, False)

# Attention
attn_handle = ext_c.make_q_attn(layernorm, layernorm_bias, ...)
ext_c.q_attn_forward_1(attn_handle, x, batch_size, q_len, ...)

# Normalization
ext_c.rms_norm(input_tensor, weight, output_tensor, epsilon)

# RoPE
ext_c.rope_(q_tensor, sin, cos, past_len, num_heads, head_dim)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment