Implementation:Predibase Lorax GPTQ Quantize Engine

Knowledge Sources	Predibase_Lorax
Domains	Quantization, Inference
Last Updated	2026-02-08 00:00 GMT

Overview

Implements the full GPTQ post-training quantization pipeline, including the Hessian-based weight quantizer, calibration data loaders, layer-sequential quantization, weight packing, and model export to safetensors format.

Description

This module provides the complete GPTQ quantization workflow for causal language models. It consists of the following major components:

Quantizer (nn.Module): A configurable quantization module that finds optimal scale and zero-point parameters for a given tensor. Supports per-channel and per-tensor quantization, symmetric and asymmetric modes, and MSE-based parameter search over a grid. The configure() method sets quantization parameters (bits, perchannel, sym, mse, norm, grid, maxshrink, trits), and find_params() computes the optimal scale and zero values.

GPTQ: The core GPTQ algorithm class. It takes an nn.Module layer and accumulates the Hessian matrix (H = 2 X X^T) via add_batch(). The fasterquant() method performs the actual quantization using Cholesky decomposition and iterative column-wise quantization with error compensation (the "OBQ" approach).

Dataset loaders: Functions get_wikitext2(), get_ptb(), get_ptb_new(), get_c4(), get_c4_new() load calibration datasets. The get_loaders() dispatcher selects the appropriate loader by name.

find_layers(): Recursively finds all nn.Linear and nn.Conv2d layers in a model, excluding lm_head.

sequential(): Orchestrates the layer-by-layer quantization process, running calibration data through each transformer layer, computing Hessian matrices, and quantizing weights.

make_quant_linear() and pack(): Replace standard linear layers with QuantLinear instances and pack the quantized weights into the compressed integer format.

quantize(): The top-level entry point that loads a model, runs the full quantization pipeline, saves the quantized model as safetensors, and optionally uploads it to the Hugging Face Hub.

Usage

This module is used as a standalone quantization tool invoked from the LoRAX CLI. The quantize() function is the main entry point, taking a model ID, bit-width, group size, output directory, and other parameters. It loads the model with empty weights using accelerate, attaches weight loading hooks, runs calibration, quantizes all layers sequentially, packs the results, and saves the output. The resulting quantized model can then be served by LoRAX using the QuantLinear layer from quant_linear.py.

Code Reference

Source Location

Repository: Predibase_Lorax
File: server/lorax_server/utils/gptq/quantize.py
Lines: 1-933

Signature

class Quantizer(nn.Module):
    def __init__(self, shape=1)
    def configure(self, bits, perchannel=False, sym=True, mse=False, norm=2.4, grid=100, maxshrink=0.8, trits=False)
    def find_params(self, x, weight=False)
    def quantize(self, x)

class GPTQ:
    def __init__(self, layer, observe=False)
    def add_batch(self, inp, out)
    def fasterquant(self, blocksize=128, percdamp=0.01, groupsize=-1, act_order=False, name="")

def get_loaders(name, nsamples=128, seed=0, seqlen=2048, model_id="")
def find_layers(module, layers=(nn.Conv2d, nn.Linear), name="")
def sequential(model, dataloader, dev, nsamples, bits, groupsize, ...)
def make_quant_linear(module, names, bits, groupsize, name="")
def pack(model, quantizers, bits, groupsize)
def quantize(
    model_id: str, bits: int, groupsize: int, output_dir: str,
    revision: str, trust_remote_code: bool, upload_to_model_id: Optional[str],
    percdamp: float, act_order: bool,
)

Import

from lorax_server.utils.gptq.quantize import Quantizer, GPTQ, quantize

I/O Contract

Inputs

Name	Type	Required	Description
model_id	str	Yes	Hugging Face model identifier to quantize
bits	int	Yes	Target quantization bit-width (2, 4, or 8)
groupsize	int	Yes	Number of input features per quantization group (-1 for per-channel)
output_dir	str	Yes	Directory to save the quantized model
revision	str	Yes	Model revision/branch to load
trust_remote_code	bool	Yes	Whether to trust remote code for model loading
upload_to_model_id	Optional[str]	No	Hugging Face repo ID to upload the quantized model
percdamp	float	Yes	Hessian dampening factor (fraction of average diagonal)
act_order	bool	Yes	Whether to quantize columns in order of decreasing activation magnitude

Outputs

Name	Type	Description
Quantized model files	safetensors files	Saved to output_dir as sharded safetensors with metadata indicating GPTQ quantization
Config and tokenizer	JSON/files	Saved model config and tokenizer alongside the quantized weights

Usage Examples

# Run quantization from Python
from lorax_server.utils.gptq.quantize import quantize

quantize(
    model_id="meta-llama/Llama-2-7b-hf",
    bits=4,
    groupsize=128,
    output_dir="/output/llama-2-7b-gptq",
    revision="main",
    trust_remote_code=False,
    upload_to_model_id=None,
    percdamp=0.01,
    act_order=False,
)

Related Pages

Environment:Predibase_Lorax_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment