Implementation:Predibase Lorax GPTQ Utils Exllamav2
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Inference |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Provides an ExLlamaV2-based 4-bit quantized linear layer implementation that uses highly optimized CUDA kernels for GPTQ and EXL2 format dequantization and matrix multiplication.
Description
This module is adapted from the turboderp/exllamav2 project and provides an alternative quantized linear layer backend that uses pre-compiled CUDA kernels (via exllamav2_kernels) instead of Triton.
ext_gemm_half_q_half(x, q_handle, q4_width, force_cuda): Wrapper around the native gemm_half_q_half CUDA kernel that performs matrix multiplication between a float16 input tensor and a quantized weight matrix referenced by its Q-matrix handle. Returns float16 output.
make_group_map(q_groups, num_qrows): Converts group metadata into a flat group map tensor used by the EXL2 kernel format. For each group, it calculates the number of rows based on bit-width and produces a mapping of group index and remaining row count.
ext_make_q_matrix(w, temp_dq): Creates a Q-matrix handle from weight dictionaries. Supports two formats:
- EXL2 format: When q_weight key is present, uses EXL2 quantization metadata (q_scale, q_perm, q_invperm, q_groups, q_group_map).
- GPTQ format: When qweight key is present, uses standard GPTQ quantization metadata (qzeros, scales, g_idx). Handles both cases with and without a non-trivial g_idx.
QuantLinear (nn.Module): A quantized linear layer that supports only 4-bit quantization. Uses a two-phase initialization: the constructor registers buffers and computes scratch space requirements, while post_init() creates the actual Q-matrix handle after all layers have been allocated. The module tracks global state (FIXED_BYTES, LAYERS) to compute the maximum scratch space needed across all layers.
ExLlamaV2DeviceTensors: Manages a shared scratch buffer on the GPU device. The prepare() method allocates the buffer, and get_scratch_slice() returns aligned slices for individual layer use.
set_device() and create_exllama_buffers(): Module-level functions for initializing the global device and creating scratch buffers, then calling post_init() on all registered layers.
Usage
This module is used as the default quantized linear layer backend for 4-bit GPTQ models when ExLlamaV2 kernels are available. During model loading, QuantLinear instances are created for each quantized layer. After all layers are instantiated, set_device() and create_exllama_buffers() are called to allocate shared scratch memory and finalize Q-matrix handles. During inference, forward() delegates to the native CUDA kernel for maximum throughput.
Code Reference
Source Location
- Repository: Predibase_Lorax
- File:
server/lorax_server/utils/gptq/exllamav2.py - Lines: 1-263
Signature
def ext_gemm_half_q_half(x, q_handle, q4_width, force_cuda)
def make_group_map(q_groups, num_qrows)
def ext_make_q_matrix(w: dict, temp_dq)
def set_device(device)
def create_exllama_buffers()
class QuantLinear(nn.Module):
QUANT_TYPE = "exllamav2"
def __init__(self, qweight, qzeros, scales, g_idx, bias, bits, groupsize)
def post_init(self, temp_dq)
def forward(self, x, force_cuda=False)
def temp_dq_size(self)
def temp_fwd_size(self, max_input_len, max_batch_size)
def scratch_spacing(self, max_input_len=8192, max_batch_size=32)
@property
def weight(self) -> torch.Tensor
class ExLlamaV2DeviceTensors:
def __init__(self, device, scratch_bytes)
def prepare(self)
def get_scratch_slice(self, size_bytes)
Import
from lorax_server.utils.gptq.exllamav2 import QuantLinear, set_device, create_exllama_buffers
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| qweight | torch.Tensor (int32) | Yes | Packed quantized weight tensor |
| qzeros | torch.Tensor (int32) | Yes | Packed quantized zero-point tensor |
| scales | torch.Tensor (float16) | Yes | Per-group scale factors |
| g_idx | torch.Tensor (int32) | Yes | Group index mapping for input features |
| bias | torch.Tensor or None | No | Optional bias vector |
| bits | int | Yes | Quantization bit-width (must be 4 for ExLlamaV2) |
| groupsize | int | Yes | Number of input features per quantization group |
| force_cuda | bool | No | Force CUDA execution path in forward pass (default False) |
Outputs
| Name | Type | Description |
|---|---|---|
| output | torch.Tensor (float16) | Result of quantized linear transformation with shape (*input_shape[:-1], outfeatures) |
Usage Examples
# Internal usage during model initialization
from lorax_server.utils.gptq.exllamav2 import QuantLinear, set_device, create_exllama_buffers
# Layers are created during model loading
layer = QuantLinear(qweight, qzeros, scales, g_idx, bias, bits=4, groupsize=128)
# After all layers are created, initialize shared buffers
set_device(torch.device("cuda:0"))
create_exllama_buffers()
# Forward pass during inference
output = layer(input_tensor)