Implementation:Predibase Lorax GPTQ Exllama V2
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Inference |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements a 4-bit quantized linear layer using the ExLlama v2 CUDA kernels, supporting both GPTQ and EXL2 quantization formats with deferred post-initialization for shared scratch memory.
Description
This module is adapted from the turboderp exllamav2 project and wraps the exllamav2_kernels C++ extension. It contains:
ext_gemm_half_q_half: A helper that performs half-precision GEMM with a quantized weight matrix handle, calling gemm_half_q_half from the C++ extension. It reshapes the input to 2D and produces output of the specified width.
make_group_map: Constructs a group map tensor needed for irregular group sizes in EXL2 format. It iterates over quantization groups and creates a mapping of rows to group indices with remaining row counts.
ext_make_q_matrix: Creates a quantized matrix handle via the C++ make_q_matrix function. It supports two formats: EXL2 (using q_weight, q_perm, q_invperm, q_scale, q_scale_max, q_groups) and GPTQ (using qweight, qzeros, scales, with optional g_idx for act-order). A dummy none_tensor on the meta device is used for absent parameters.
QuantLinear: An nn.Module with QUANT_TYPE = "exllamav2" that holds quantized weight tensors. During __init__, it calculates required scratch space and registers itself in a global LAYERS list. The post_init method is called later by create_exllama_buffers to allocate the actual Q matrix handle using shared scratch memory managed by ExLlamaV2DeviceTensors. The forward method calls ext_gemm_half_q_half with an optional force_cuda flag and adds bias.
ExLlamaV2DeviceTensors: A helper class that lazily allocates a contiguous scratch buffer on GPU and provides aligned slices via get_scratch_slice, ensuring 128-byte alignment.
create_exllama_buffers: Iterates over all registered layers and calls their post_init with the shared device tensors.
Usage
This is the primary quantized linear layer used when loading GPTQ models with exllama v2 acceleration (the default path when use_exllama is True in the get_linear factory). Layers register themselves during construction, and create_exllama_buffers must be called before inference to initialize shared scratch memory.
Code Reference
Source Location
- Repository: Predibase_Lorax
- File: server/lorax_server/layers/gptq/exllamav2.py
- Lines: 1-228
Signature
class QuantLinear(nn.Module):
QUANT_TYPE = "exllamav2"
def __init__(self, qweight, qzeros, scales, g_idx, bias, bits, groupsize):
class ExLlamaV2DeviceTensors:
def __init__(self, device, scratch_bytes):
Import
from lorax_server.layers.gptq.exllamav2 import QuantLinear, create_exllama_buffers
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| qweight | torch.Tensor (int32) | Yes | Packed 4-bit quantized weight matrix on CUDA |
| qzeros | torch.Tensor (int32) | Yes | Packed quantized zero points |
| scales | torch.Tensor (float16) | Yes | Per-group scale factors |
| g_idx | torch.Tensor (int32) or None | No | Group index for activation ordering |
| bias | torch.Tensor or None | No | Optional bias vector |
| bits | int | Yes | Must be 4 (only 4-bit quantization is supported by exllamav2) |
| groupsize | int | Yes | Number of input features per quantization group |
Outputs
| Name | Type | Description |
|---|---|---|
| output | torch.Tensor (float16) | Result of the quantized linear transformation |
Usage Examples
# Used internally by model layers via the linear factory
from lorax_server.layers.gptq.exllamav2 import QuantLinear, set_device, create_exllama_buffers
set_device(torch.device("cuda:0"))
layer = QuantLinear(qweight, qzeros, scales, g_idx, bias, bits=4, groupsize=128)
create_exllama_buffers(max_total_tokens=2048)
output = layer(input_tensor)