Implementation:Hpcaitech ColossalAI GPTQ Quantizer

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Model Quantization, Model Optimization, CUDA
Last Updated	2026-02-09 00:00 GMT

Overview

GPTQ quantization implementation for LLaMA models, providing weight quantization and CUDA-accelerated quantized linear layers.

Description

This module implements GPTQ (Generative Pre-Training Quantization) for LLaMA models, copied from the GPTQ-for-LLaMa project. It contains three main components: the quantize function for applying scale-and-zero-point quantization, the Quantizer nn.Module for computing optimal quantization parameters with support for per-channel quantization, symmetric/asymmetric modes, and MSE-based grid search optimization, and the QuantLinear nn.Module that implements a quantized linear layer supporting 2, 3, 4, and 8-bit quantization with CUDA-accelerated matrix multiplication. The make_quant helper function recursively replaces standard linear layers in a model with QuantLinear layers.

Usage

Use this module to quantize LLaMA model weights for reduced memory footprint and faster inference. It requires the quant_cuda CUDA extension for the forward pass of quantized layers.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalChat/coati/quant/llama_gptq/quant.py
Lines: 1-283

Signature

def quantize(x, scale, zero, maxq):

class Quantizer(nn.Module):
    def __init__(self, shape=1):
    def configure(self, bits, perchannel=False, sym=True, mse=False, norm=2.4, grid=100, maxshrink=0.8):
    def find_params(self, x, weight=False):
    def quantize(self, x):
    def enabled(self):
    def ready(self):

class QuantLinear(nn.Module):
    def __init__(self, bits, groupsize, infeatures, outfeatures):
    def pack(self, linear, scales, zeros):
    def forward(self, x):

def make_quant(module, names, bits, groupsize, name=""):

Import

from coati.quant.llama_gptq.quant import Quantizer, QuantLinear, make_quant, quantize

I/O Contract

Inputs (Quantizer.configure)

Name	Type	Required	Description
bits	int	Yes	Number of quantization bits (2, 3, 4, or 8)
perchannel	bool	No	Whether to use per-channel quantization, defaults to False
sym	bool	No	Whether to use symmetric quantization, defaults to True
mse	bool	No	Whether to use MSE-based grid search for optimal parameters, defaults to False
norm	float	No	Norm used for MSE computation, defaults to 2.4
grid	int	No	Grid size for MSE search, defaults to 100
maxshrink	float	No	Maximum shrinkage factor for MSE search, defaults to 0.8

Inputs (QuantLinear.init)

Name	Type	Required	Description
bits	int	Yes	Number of quantization bits (2, 3, 4, or 8)
groupsize	int	Yes	Group size for quantization; -1 uses infeatures as group size
infeatures	int	Yes	Number of input features
outfeatures	int	Yes	Number of output features

Outputs (QuantLinear.forward)

Name	Type	Description
return	torch.Tensor	Output tensor with shape matching input except last dim is outfeatures

Usage Examples

from coati.quant.llama_gptq.quant import Quantizer, QuantLinear, make_quant

# Configure and find quantization parameters
quantizer = Quantizer()
quantizer.configure(bits=4, perchannel=True, sym=True)
quantizer.find_params(weight_tensor, weight=True)
quantized_weight = quantizer.quantize(weight_tensor)

# Replace linear layers with quantized versions
names = ["model.layers.0.self_attn.q_proj", "model.layers.0.self_attn.k_proj"]
make_quant(model, names, bits=4, groupsize=128)

Related Pages

Environment:Hpcaitech_ColossalAI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment