Implementation:Predibase Lorax GPTQ Utils Exllamav2

Knowledge Sources	Predibase_Lorax
Domains	Quantization, Inference
Last Updated	2026-02-08 00:00 GMT

Overview

Provides an ExLlamaV2-based 4-bit quantized linear layer implementation that uses highly optimized CUDA kernels for GPTQ and EXL2 format dequantization and matrix multiplication.

Description

This module is adapted from the turboderp/exllamav2 project and provides an alternative quantized linear layer backend that uses pre-compiled CUDA kernels (via exllamav2_kernels) instead of Triton.

ext_gemm_half_q_half(x, q_handle, q4_width, force_cuda): Wrapper around the native gemm_half_q_half CUDA kernel that performs matrix multiplication between a float16 input tensor and a quantized weight matrix referenced by its Q-matrix handle. Returns float16 output.

make_group_map(q_groups, num_qrows): Converts group metadata into a flat group map tensor used by the EXL2 kernel format. For each group, it calculates the number of rows based on bit-width and produces a mapping of group index and remaining row count.

ext_make_q_matrix(w, temp_dq): Creates a Q-matrix handle from weight dictionaries. Supports two formats:

EXL2 format: When q_weight key is present, uses EXL2 quantization metadata (q_scale, q_perm, q_invperm, q_groups, q_group_map).
GPTQ format: When qweight key is present, uses standard GPTQ quantization metadata (qzeros, scales, g_idx). Handles both cases with and without a non-trivial g_idx.

QuantLinear (nn.Module): A quantized linear layer that supports only 4-bit quantization. Uses a two-phase initialization: the constructor registers buffers and computes scratch space requirements, while post_init() creates the actual Q-matrix handle after all layers have been allocated. The module tracks global state (FIXED_BYTES, LAYERS) to compute the maximum scratch space needed across all layers.

ExLlamaV2DeviceTensors: Manages a shared scratch buffer on the GPU device. The prepare() method allocates the buffer, and get_scratch_slice() returns aligned slices for individual layer use.

set_device() and create_exllama_buffers(): Module-level functions for initializing the global device and creating scratch buffers, then calling post_init() on all registered layers.

Usage

This module is used as the default quantized linear layer backend for 4-bit GPTQ models when ExLlamaV2 kernels are available. During model loading, QuantLinear instances are created for each quantized layer. After all layers are instantiated, set_device() and create_exllama_buffers() are called to allocate shared scratch memory and finalize Q-matrix handles. During inference, forward() delegates to the native CUDA kernel for maximum throughput.

Code Reference

Source Location

Repository: Predibase_Lorax
File: server/lorax_server/utils/gptq/exllamav2.py
Lines: 1-263

Signature

def ext_gemm_half_q_half(x, q_handle, q4_width, force_cuda)
def make_group_map(q_groups, num_qrows)
def ext_make_q_matrix(w: dict, temp_dq)

def set_device(device)
def create_exllama_buffers()

class QuantLinear(nn.Module):
    QUANT_TYPE = "exllamav2"
    def __init__(self, qweight, qzeros, scales, g_idx, bias, bits, groupsize)
    def post_init(self, temp_dq)
    def forward(self, x, force_cuda=False)
    def temp_dq_size(self)
    def temp_fwd_size(self, max_input_len, max_batch_size)
    def scratch_spacing(self, max_input_len=8192, max_batch_size=32)
    @property
    def weight(self) -> torch.Tensor

class ExLlamaV2DeviceTensors:
    def __init__(self, device, scratch_bytes)
    def prepare(self)
    def get_scratch_slice(self, size_bytes)

Import

from lorax_server.utils.gptq.exllamav2 import QuantLinear, set_device, create_exllama_buffers

I/O Contract

Inputs

Name	Type	Required	Description
qweight	torch.Tensor (int32)	Yes	Packed quantized weight tensor
qzeros	torch.Tensor (int32)	Yes	Packed quantized zero-point tensor
scales	torch.Tensor (float16)	Yes	Per-group scale factors
g_idx	torch.Tensor (int32)	Yes	Group index mapping for input features
bias	torch.Tensor or None	No	Optional bias vector
bits	int	Yes	Quantization bit-width (must be 4 for ExLlamaV2)
groupsize	int	Yes	Number of input features per quantization group
force_cuda	bool	No	Force CUDA execution path in forward pass (default False)

Outputs

Name	Type	Description
output	torch.Tensor (float16)	Result of quantized linear transformation with shape (*input_shape[:-1], outfeatures)

Usage Examples

# Internal usage during model initialization
from lorax_server.utils.gptq.exllamav2 import QuantLinear, set_device, create_exllama_buffers

# Layers are created during model loading
layer = QuantLinear(qweight, qzeros, scales, g_idx, bias, bits=4, groupsize=128)

# After all layers are created, initialize shared buffers
set_device(torch.device("cuda:0"))
create_exllama_buffers()

# Forward pass during inference
output = layer(input_tensor)

Related Pages

Environment:Predibase_Lorax_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment