Implementation:Predibase Lorax GPTQ Quantize Engine
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Inference |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements the full GPTQ post-training quantization pipeline, including the Hessian-based weight quantizer, calibration data loaders, layer-sequential quantization, weight packing, and model export to safetensors format.
Description
This module provides the complete GPTQ quantization workflow for causal language models. It consists of the following major components:
Quantizer (nn.Module): A configurable quantization module that finds optimal scale and zero-point parameters for a given tensor. Supports per-channel and per-tensor quantization, symmetric and asymmetric modes, and MSE-based parameter search over a grid. The configure() method sets quantization parameters (bits, perchannel, sym, mse, norm, grid, maxshrink, trits), and find_params() computes the optimal scale and zero values.
GPTQ: The core GPTQ algorithm class. It takes an nn.Module layer and accumulates the Hessian matrix (H = 2 X X^T) via add_batch(). The fasterquant() method performs the actual quantization using Cholesky decomposition and iterative column-wise quantization with error compensation (the "OBQ" approach).
Dataset loaders: Functions get_wikitext2(), get_ptb(), get_ptb_new(), get_c4(), get_c4_new() load calibration datasets. The get_loaders() dispatcher selects the appropriate loader by name.
find_layers(): Recursively finds all nn.Linear and nn.Conv2d layers in a model, excluding lm_head.
sequential(): Orchestrates the layer-by-layer quantization process, running calibration data through each transformer layer, computing Hessian matrices, and quantizing weights.
make_quant_linear() and pack(): Replace standard linear layers with QuantLinear instances and pack the quantized weights into the compressed integer format.
quantize(): The top-level entry point that loads a model, runs the full quantization pipeline, saves the quantized model as safetensors, and optionally uploads it to the Hugging Face Hub.
Usage
This module is used as a standalone quantization tool invoked from the LoRAX CLI. The quantize() function is the main entry point, taking a model ID, bit-width, group size, output directory, and other parameters. It loads the model with empty weights using accelerate, attaches weight loading hooks, runs calibration, quantizes all layers sequentially, packs the results, and saves the output. The resulting quantized model can then be served by LoRAX using the QuantLinear layer from quant_linear.py.
Code Reference
Source Location
- Repository: Predibase_Lorax
- File:
server/lorax_server/utils/gptq/quantize.py - Lines: 1-933
Signature
class Quantizer(nn.Module):
def __init__(self, shape=1)
def configure(self, bits, perchannel=False, sym=True, mse=False, norm=2.4, grid=100, maxshrink=0.8, trits=False)
def find_params(self, x, weight=False)
def quantize(self, x)
class GPTQ:
def __init__(self, layer, observe=False)
def add_batch(self, inp, out)
def fasterquant(self, blocksize=128, percdamp=0.01, groupsize=-1, act_order=False, name="")
def get_loaders(name, nsamples=128, seed=0, seqlen=2048, model_id="")
def find_layers(module, layers=(nn.Conv2d, nn.Linear), name="")
def sequential(model, dataloader, dev, nsamples, bits, groupsize, ...)
def make_quant_linear(module, names, bits, groupsize, name="")
def pack(model, quantizers, bits, groupsize)
def quantize(
model_id: str, bits: int, groupsize: int, output_dir: str,
revision: str, trust_remote_code: bool, upload_to_model_id: Optional[str],
percdamp: float, act_order: bool,
)
Import
from lorax_server.utils.gptq.quantize import Quantizer, GPTQ, quantize
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_id | str | Yes | Hugging Face model identifier to quantize |
| bits | int | Yes | Target quantization bit-width (2, 4, or 8) |
| groupsize | int | Yes | Number of input features per quantization group (-1 for per-channel) |
| output_dir | str | Yes | Directory to save the quantized model |
| revision | str | Yes | Model revision/branch to load |
| trust_remote_code | bool | Yes | Whether to trust remote code for model loading |
| upload_to_model_id | Optional[str] | No | Hugging Face repo ID to upload the quantized model |
| percdamp | float | Yes | Hessian dampening factor (fraction of average diagonal) |
| act_order | bool | Yes | Whether to quantize columns in order of decreasing activation magnitude |
Outputs
| Name | Type | Description |
|---|---|---|
| Quantized model files | safetensors files | Saved to output_dir as sharded safetensors with metadata indicating GPTQ quantization |
| Config and tokenizer | JSON/files | Saved model config and tokenizer alongside the quantized weights |
Usage Examples
# Run quantization from Python
from lorax_server.utils.gptq.quantize import quantize
quantize(
model_id="meta-llama/Llama-2-7b-hf",
bits=4,
groupsize=128,
output_dir="/output/llama-2-7b-gptq",
revision="main",
trust_remote_code=False,
upload_to_model_id=None,
percdamp=0.01,
act_order=False,
)