Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Predibase Lorax GPTQ Exllama V2

From Leeroopedia
Revision as of 16:20, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Predibase_Lorax_GPTQ_Exllama_V2.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Quantization, Inference
Last Updated 2026-02-08 00:00 GMT

Overview

Implements a 4-bit quantized linear layer using the ExLlama v2 CUDA kernels, supporting both GPTQ and EXL2 quantization formats with deferred post-initialization for shared scratch memory.

Description

This module is adapted from the turboderp exllamav2 project and wraps the exllamav2_kernels C++ extension. It contains:

ext_gemm_half_q_half: A helper that performs half-precision GEMM with a quantized weight matrix handle, calling gemm_half_q_half from the C++ extension. It reshapes the input to 2D and produces output of the specified width.

make_group_map: Constructs a group map tensor needed for irregular group sizes in EXL2 format. It iterates over quantization groups and creates a mapping of rows to group indices with remaining row counts.

ext_make_q_matrix: Creates a quantized matrix handle via the C++ make_q_matrix function. It supports two formats: EXL2 (using q_weight, q_perm, q_invperm, q_scale, q_scale_max, q_groups) and GPTQ (using qweight, qzeros, scales, with optional g_idx for act-order). A dummy none_tensor on the meta device is used for absent parameters.

QuantLinear: An nn.Module with QUANT_TYPE = "exllamav2" that holds quantized weight tensors. During __init__, it calculates required scratch space and registers itself in a global LAYERS list. The post_init method is called later by create_exllama_buffers to allocate the actual Q matrix handle using shared scratch memory managed by ExLlamaV2DeviceTensors. The forward method calls ext_gemm_half_q_half with an optional force_cuda flag and adds bias.

ExLlamaV2DeviceTensors: A helper class that lazily allocates a contiguous scratch buffer on GPU and provides aligned slices via get_scratch_slice, ensuring 128-byte alignment.

create_exllama_buffers: Iterates over all registered layers and calls their post_init with the shared device tensors.

Usage

This is the primary quantized linear layer used when loading GPTQ models with exllama v2 acceleration (the default path when use_exllama is True in the get_linear factory). Layers register themselves during construction, and create_exllama_buffers must be called before inference to initialize shared scratch memory.

Code Reference

Source Location

  • Repository: Predibase_Lorax
  • File: server/lorax_server/layers/gptq/exllamav2.py
  • Lines: 1-228

Signature

class QuantLinear(nn.Module):
    QUANT_TYPE = "exllamav2"
    def __init__(self, qweight, qzeros, scales, g_idx, bias, bits, groupsize):

class ExLlamaV2DeviceTensors:
    def __init__(self, device, scratch_bytes):

Import

from lorax_server.layers.gptq.exllamav2 import QuantLinear, create_exllama_buffers

I/O Contract

Inputs

Name Type Required Description
qweight torch.Tensor (int32) Yes Packed 4-bit quantized weight matrix on CUDA
qzeros torch.Tensor (int32) Yes Packed quantized zero points
scales torch.Tensor (float16) Yes Per-group scale factors
g_idx torch.Tensor (int32) or None No Group index for activation ordering
bias torch.Tensor or None No Optional bias vector
bits int Yes Must be 4 (only 4-bit quantization is supported by exllamav2)
groupsize int Yes Number of input features per quantization group

Outputs

Name Type Description
output torch.Tensor (float16) Result of the quantized linear transformation

Usage Examples

# Used internally by model layers via the linear factory
from lorax_server.layers.gptq.exllamav2 import QuantLinear, set_device, create_exllama_buffers

set_device(torch.device("cuda:0"))
layer = QuantLinear(qweight, qzeros, scales, g_idx, bias, bits=4, groupsize=128)
create_exllama_buffers(max_total_tokens=2048)
output = layer(input_tensor)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment