Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Predibase Lorax GPTQ Quantize Engine

From Leeroopedia


Knowledge Sources
Domains Quantization, Inference
Last Updated 2026-02-08 00:00 GMT

Overview

Implements the full GPTQ post-training quantization pipeline, including the Hessian-based weight quantizer, calibration data loaders, layer-sequential quantization, weight packing, and model export to safetensors format.

Description

This module provides the complete GPTQ quantization workflow for causal language models. It consists of the following major components:

Quantizer (nn.Module): A configurable quantization module that finds optimal scale and zero-point parameters for a given tensor. Supports per-channel and per-tensor quantization, symmetric and asymmetric modes, and MSE-based parameter search over a grid. The configure() method sets quantization parameters (bits, perchannel, sym, mse, norm, grid, maxshrink, trits), and find_params() computes the optimal scale and zero values.

GPTQ: The core GPTQ algorithm class. It takes an nn.Module layer and accumulates the Hessian matrix (H = 2 X X^T) via add_batch(). The fasterquant() method performs the actual quantization using Cholesky decomposition and iterative column-wise quantization with error compensation (the "OBQ" approach).

Dataset loaders: Functions get_wikitext2(), get_ptb(), get_ptb_new(), get_c4(), get_c4_new() load calibration datasets. The get_loaders() dispatcher selects the appropriate loader by name.

find_layers(): Recursively finds all nn.Linear and nn.Conv2d layers in a model, excluding lm_head.

sequential(): Orchestrates the layer-by-layer quantization process, running calibration data through each transformer layer, computing Hessian matrices, and quantizing weights.

make_quant_linear() and pack(): Replace standard linear layers with QuantLinear instances and pack the quantized weights into the compressed integer format.

quantize(): The top-level entry point that loads a model, runs the full quantization pipeline, saves the quantized model as safetensors, and optionally uploads it to the Hugging Face Hub.

Usage

This module is used as a standalone quantization tool invoked from the LoRAX CLI. The quantize() function is the main entry point, taking a model ID, bit-width, group size, output directory, and other parameters. It loads the model with empty weights using accelerate, attaches weight loading hooks, runs calibration, quantizes all layers sequentially, packs the results, and saves the output. The resulting quantized model can then be served by LoRAX using the QuantLinear layer from quant_linear.py.

Code Reference

Source Location

  • Repository: Predibase_Lorax
  • File: server/lorax_server/utils/gptq/quantize.py
  • Lines: 1-933

Signature

class Quantizer(nn.Module):
    def __init__(self, shape=1)
    def configure(self, bits, perchannel=False, sym=True, mse=False, norm=2.4, grid=100, maxshrink=0.8, trits=False)
    def find_params(self, x, weight=False)
    def quantize(self, x)

class GPTQ:
    def __init__(self, layer, observe=False)
    def add_batch(self, inp, out)
    def fasterquant(self, blocksize=128, percdamp=0.01, groupsize=-1, act_order=False, name="")

def get_loaders(name, nsamples=128, seed=0, seqlen=2048, model_id="")
def find_layers(module, layers=(nn.Conv2d, nn.Linear), name="")
def sequential(model, dataloader, dev, nsamples, bits, groupsize, ...)
def make_quant_linear(module, names, bits, groupsize, name="")
def pack(model, quantizers, bits, groupsize)
def quantize(
    model_id: str, bits: int, groupsize: int, output_dir: str,
    revision: str, trust_remote_code: bool, upload_to_model_id: Optional[str],
    percdamp: float, act_order: bool,
)

Import

from lorax_server.utils.gptq.quantize import Quantizer, GPTQ, quantize

I/O Contract

Inputs

Name Type Required Description
model_id str Yes Hugging Face model identifier to quantize
bits int Yes Target quantization bit-width (2, 4, or 8)
groupsize int Yes Number of input features per quantization group (-1 for per-channel)
output_dir str Yes Directory to save the quantized model
revision str Yes Model revision/branch to load
trust_remote_code bool Yes Whether to trust remote code for model loading
upload_to_model_id Optional[str] No Hugging Face repo ID to upload the quantized model
percdamp float Yes Hessian dampening factor (fraction of average diagonal)
act_order bool Yes Whether to quantize columns in order of decreasing activation magnitude

Outputs

Name Type Description
Quantized model files safetensors files Saved to output_dir as sharded safetensors with metadata indicating GPTQ quantization
Config and tokenizer JSON/files Saved model config and tokenizer alongside the quantized weights

Usage Examples

# Run quantization from Python
from lorax_server.utils.gptq.quantize import quantize

quantize(
    model_id="meta-llama/Llama-2-7b-hf",
    bits=4,
    groupsize=128,
    output_dir="/output/llama-2-7b-gptq",
    revision="main",
    trust_remote_code=False,
    upload_to_model_id=None,
    percdamp=0.01,
    act_order=False,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment