Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Predibase Lorax GPTQ Utils Exllamav2

From Leeroopedia


Knowledge Sources
Domains Quantization, Inference
Last Updated 2026-02-08 00:00 GMT

Overview

Provides an ExLlamaV2-based 4-bit quantized linear layer implementation that uses highly optimized CUDA kernels for GPTQ and EXL2 format dequantization and matrix multiplication.

Description

This module is adapted from the turboderp/exllamav2 project and provides an alternative quantized linear layer backend that uses pre-compiled CUDA kernels (via exllamav2_kernels) instead of Triton.

ext_gemm_half_q_half(x, q_handle, q4_width, force_cuda): Wrapper around the native gemm_half_q_half CUDA kernel that performs matrix multiplication between a float16 input tensor and a quantized weight matrix referenced by its Q-matrix handle. Returns float16 output.

make_group_map(q_groups, num_qrows): Converts group metadata into a flat group map tensor used by the EXL2 kernel format. For each group, it calculates the number of rows based on bit-width and produces a mapping of group index and remaining row count.

ext_make_q_matrix(w, temp_dq): Creates a Q-matrix handle from weight dictionaries. Supports two formats:

  • EXL2 format: When q_weight key is present, uses EXL2 quantization metadata (q_scale, q_perm, q_invperm, q_groups, q_group_map).
  • GPTQ format: When qweight key is present, uses standard GPTQ quantization metadata (qzeros, scales, g_idx). Handles both cases with and without a non-trivial g_idx.

QuantLinear (nn.Module): A quantized linear layer that supports only 4-bit quantization. Uses a two-phase initialization: the constructor registers buffers and computes scratch space requirements, while post_init() creates the actual Q-matrix handle after all layers have been allocated. The module tracks global state (FIXED_BYTES, LAYERS) to compute the maximum scratch space needed across all layers.

ExLlamaV2DeviceTensors: Manages a shared scratch buffer on the GPU device. The prepare() method allocates the buffer, and get_scratch_slice() returns aligned slices for individual layer use.

set_device() and create_exllama_buffers(): Module-level functions for initializing the global device and creating scratch buffers, then calling post_init() on all registered layers.

Usage

This module is used as the default quantized linear layer backend for 4-bit GPTQ models when ExLlamaV2 kernels are available. During model loading, QuantLinear instances are created for each quantized layer. After all layers are instantiated, set_device() and create_exllama_buffers() are called to allocate shared scratch memory and finalize Q-matrix handles. During inference, forward() delegates to the native CUDA kernel for maximum throughput.

Code Reference

Source Location

  • Repository: Predibase_Lorax
  • File: server/lorax_server/utils/gptq/exllamav2.py
  • Lines: 1-263

Signature

def ext_gemm_half_q_half(x, q_handle, q4_width, force_cuda)
def make_group_map(q_groups, num_qrows)
def ext_make_q_matrix(w: dict, temp_dq)

def set_device(device)
def create_exllama_buffers()

class QuantLinear(nn.Module):
    QUANT_TYPE = "exllamav2"
    def __init__(self, qweight, qzeros, scales, g_idx, bias, bits, groupsize)
    def post_init(self, temp_dq)
    def forward(self, x, force_cuda=False)
    def temp_dq_size(self)
    def temp_fwd_size(self, max_input_len, max_batch_size)
    def scratch_spacing(self, max_input_len=8192, max_batch_size=32)
    @property
    def weight(self) -> torch.Tensor

class ExLlamaV2DeviceTensors:
    def __init__(self, device, scratch_bytes)
    def prepare(self)
    def get_scratch_slice(self, size_bytes)

Import

from lorax_server.utils.gptq.exllamav2 import QuantLinear, set_device, create_exllama_buffers

I/O Contract

Inputs

Name Type Required Description
qweight torch.Tensor (int32) Yes Packed quantized weight tensor
qzeros torch.Tensor (int32) Yes Packed quantized zero-point tensor
scales torch.Tensor (float16) Yes Per-group scale factors
g_idx torch.Tensor (int32) Yes Group index mapping for input features
bias torch.Tensor or None No Optional bias vector
bits int Yes Quantization bit-width (must be 4 for ExLlamaV2)
groupsize int Yes Number of input features per quantization group
force_cuda bool No Force CUDA execution path in forward pass (default False)

Outputs

Name Type Description
output torch.Tensor (float16) Result of quantized linear transformation with shape (*input_shape[:-1], outfeatures)

Usage Examples

# Internal usage during model initialization
from lorax_server.utils.gptq.exllamav2 import QuantLinear, set_device, create_exllama_buffers

# Layers are created during model loading
layer = QuantLinear(qweight, qzeros, scales, g_idx, bias, bits=4, groupsize=128)

# After all layers are created, initialize shared buffers
set_device(torch.device("cuda:0"))
create_exllama_buffers()

# Forward pass during inference
output = layer(input_tensor)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment