Implementation:Predibase Lorax GPTQ Exllama V2

Knowledge Sources	Predibase_Lorax
Domains	Quantization, Inference
Last Updated	2026-02-08 00:00 GMT

Overview

Implements a 4-bit quantized linear layer using the ExLlama v2 CUDA kernels, supporting both GPTQ and EXL2 quantization formats with deferred post-initialization for shared scratch memory.

Description

This module is adapted from the turboderp exllamav2 project and wraps the exllamav2_kernels C++ extension. It contains:

ext_gemm_half_q_half: A helper that performs half-precision GEMM with a quantized weight matrix handle, calling gemm_half_q_half from the C++ extension. It reshapes the input to 2D and produces output of the specified width.

make_group_map: Constructs a group map tensor needed for irregular group sizes in EXL2 format. It iterates over quantization groups and creates a mapping of rows to group indices with remaining row counts.

ext_make_q_matrix: Creates a quantized matrix handle via the C++ make_q_matrix function. It supports two formats: EXL2 (using q_weight, q_perm, q_invperm, q_scale, q_scale_max, q_groups) and GPTQ (using qweight, qzeros, scales, with optional g_idx for act-order). A dummy none_tensor on the meta device is used for absent parameters.

QuantLinear: An nn.Module with QUANT_TYPE = "exllamav2" that holds quantized weight tensors. During __init__, it calculates required scratch space and registers itself in a global LAYERS list. The post_init method is called later by create_exllama_buffers to allocate the actual Q matrix handle using shared scratch memory managed by ExLlamaV2DeviceTensors. The forward method calls ext_gemm_half_q_half with an optional force_cuda flag and adds bias.

ExLlamaV2DeviceTensors: A helper class that lazily allocates a contiguous scratch buffer on GPU and provides aligned slices via get_scratch_slice, ensuring 128-byte alignment.

create_exllama_buffers: Iterates over all registered layers and calls their post_init with the shared device tensors.

Usage

This is the primary quantized linear layer used when loading GPTQ models with exllama v2 acceleration (the default path when use_exllama is True in the get_linear factory). Layers register themselves during construction, and create_exllama_buffers must be called before inference to initialize shared scratch memory.

Code Reference

Source Location

Repository: Predibase_Lorax
File: server/lorax_server/layers/gptq/exllamav2.py
Lines: 1-228

Signature

class QuantLinear(nn.Module):
    QUANT_TYPE = "exllamav2"
    def __init__(self, qweight, qzeros, scales, g_idx, bias, bits, groupsize):

class ExLlamaV2DeviceTensors:
    def __init__(self, device, scratch_bytes):

Import

from lorax_server.layers.gptq.exllamav2 import QuantLinear, create_exllama_buffers

I/O Contract

Inputs

Name	Type	Required	Description
qweight	torch.Tensor (int32)	Yes	Packed 4-bit quantized weight matrix on CUDA
qzeros	torch.Tensor (int32)	Yes	Packed quantized zero points
scales	torch.Tensor (float16)	Yes	Per-group scale factors
g_idx	torch.Tensor (int32) or None	No	Group index for activation ordering
bias	torch.Tensor or None	No	Optional bias vector
bits	int	Yes	Must be 4 (only 4-bit quantization is supported by exllamav2)
groupsize	int	Yes	Number of input features per quantization group

Outputs

Name	Type	Description
output	torch.Tensor (float16)	Result of the quantized linear transformation

Usage Examples

# Used internally by model layers via the linear factory
from lorax_server.layers.gptq.exllamav2 import QuantLinear, set_device, create_exllama_buffers

set_device(torch.device("cuda:0"))
layer = QuantLinear(qweight, qzeros, scales, g_idx, bias, bits=4, groupsize=128)
create_exllama_buffers(max_total_tokens=2048)
output = layer(input_tensor)

Related Pages

Environment:Predibase_Lorax_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment