Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI GPTQ Quantizer

From Leeroopedia


Knowledge Sources
Domains Model Quantization, Model Optimization, CUDA
Last Updated 2026-02-09 00:00 GMT

Overview

GPTQ quantization implementation for LLaMA models, providing weight quantization and CUDA-accelerated quantized linear layers.

Description

This module implements GPTQ (Generative Pre-Training Quantization) for LLaMA models, copied from the GPTQ-for-LLaMa project. It contains three main components: the quantize function for applying scale-and-zero-point quantization, the Quantizer nn.Module for computing optimal quantization parameters with support for per-channel quantization, symmetric/asymmetric modes, and MSE-based grid search optimization, and the QuantLinear nn.Module that implements a quantized linear layer supporting 2, 3, 4, and 8-bit quantization with CUDA-accelerated matrix multiplication. The make_quant helper function recursively replaces standard linear layers in a model with QuantLinear layers.

Usage

Use this module to quantize LLaMA model weights for reduced memory footprint and faster inference. It requires the quant_cuda CUDA extension for the forward pass of quantized layers.

Code Reference

Source Location

Signature

def quantize(x, scale, zero, maxq):

class Quantizer(nn.Module):
    def __init__(self, shape=1):
    def configure(self, bits, perchannel=False, sym=True, mse=False, norm=2.4, grid=100, maxshrink=0.8):
    def find_params(self, x, weight=False):
    def quantize(self, x):
    def enabled(self):
    def ready(self):

class QuantLinear(nn.Module):
    def __init__(self, bits, groupsize, infeatures, outfeatures):
    def pack(self, linear, scales, zeros):
    def forward(self, x):

def make_quant(module, names, bits, groupsize, name=""):

Import

from coati.quant.llama_gptq.quant import Quantizer, QuantLinear, make_quant, quantize

I/O Contract

Inputs (Quantizer.configure)

Name Type Required Description
bits int Yes Number of quantization bits (2, 3, 4, or 8)
perchannel bool No Whether to use per-channel quantization, defaults to False
sym bool No Whether to use symmetric quantization, defaults to True
mse bool No Whether to use MSE-based grid search for optimal parameters, defaults to False
norm float No Norm used for MSE computation, defaults to 2.4
grid int No Grid size for MSE search, defaults to 100
maxshrink float No Maximum shrinkage factor for MSE search, defaults to 0.8

Inputs (QuantLinear.__init__)

Name Type Required Description
bits int Yes Number of quantization bits (2, 3, 4, or 8)
groupsize int Yes Group size for quantization; -1 uses infeatures as group size
infeatures int Yes Number of input features
outfeatures int Yes Number of output features

Outputs (QuantLinear.forward)

Name Type Description
return torch.Tensor Output tensor with shape matching input except last dim is outfeatures

Usage Examples

from coati.quant.llama_gptq.quant import Quantizer, QuantLinear, make_quant

# Configure and find quantization parameters
quantizer = Quantizer()
quantizer.configure(bits=4, perchannel=True, sym=True)
quantizer.find_params(weight_tensor, weight=True)
quantized_weight = quantizer.quantize(weight_tensor)

# Replace linear layers with quantized versions
names = ["model.layers.0.self_attn.q_proj", "model.layers.0.self_attn.k_proj"]
make_quant(model, names, bits=4, groupsize=128)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment