Implementation:Hpcaitech ColossalAI GPTQ Quantizer
| Knowledge Sources | |
|---|---|
| Domains | Model Quantization, Model Optimization, CUDA |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
GPTQ quantization implementation for LLaMA models, providing weight quantization and CUDA-accelerated quantized linear layers.
Description
This module implements GPTQ (Generative Pre-Training Quantization) for LLaMA models, copied from the GPTQ-for-LLaMa project. It contains three main components: the quantize function for applying scale-and-zero-point quantization, the Quantizer nn.Module for computing optimal quantization parameters with support for per-channel quantization, symmetric/asymmetric modes, and MSE-based grid search optimization, and the QuantLinear nn.Module that implements a quantized linear layer supporting 2, 3, 4, and 8-bit quantization with CUDA-accelerated matrix multiplication. The make_quant helper function recursively replaces standard linear layers in a model with QuantLinear layers.
Usage
Use this module to quantize LLaMA model weights for reduced memory footprint and faster inference. It requires the quant_cuda CUDA extension for the forward pass of quantized layers.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalChat/coati/quant/llama_gptq/quant.py
- Lines: 1-283
Signature
def quantize(x, scale, zero, maxq):
class Quantizer(nn.Module):
def __init__(self, shape=1):
def configure(self, bits, perchannel=False, sym=True, mse=False, norm=2.4, grid=100, maxshrink=0.8):
def find_params(self, x, weight=False):
def quantize(self, x):
def enabled(self):
def ready(self):
class QuantLinear(nn.Module):
def __init__(self, bits, groupsize, infeatures, outfeatures):
def pack(self, linear, scales, zeros):
def forward(self, x):
def make_quant(module, names, bits, groupsize, name=""):
Import
from coati.quant.llama_gptq.quant import Quantizer, QuantLinear, make_quant, quantize
I/O Contract
Inputs (Quantizer.configure)
| Name | Type | Required | Description |
|---|---|---|---|
| bits | int | Yes | Number of quantization bits (2, 3, 4, or 8) |
| perchannel | bool | No | Whether to use per-channel quantization, defaults to False |
| sym | bool | No | Whether to use symmetric quantization, defaults to True |
| mse | bool | No | Whether to use MSE-based grid search for optimal parameters, defaults to False |
| norm | float | No | Norm used for MSE computation, defaults to 2.4 |
| grid | int | No | Grid size for MSE search, defaults to 100 |
| maxshrink | float | No | Maximum shrinkage factor for MSE search, defaults to 0.8 |
Inputs (QuantLinear.__init__)
| Name | Type | Required | Description |
|---|---|---|---|
| bits | int | Yes | Number of quantization bits (2, 3, 4, or 8) |
| groupsize | int | Yes | Group size for quantization; -1 uses infeatures as group size |
| infeatures | int | Yes | Number of input features |
| outfeatures | int | Yes | Number of output features |
Outputs (QuantLinear.forward)
| Name | Type | Description |
|---|---|---|
| return | torch.Tensor | Output tensor with shape matching input except last dim is outfeatures |
Usage Examples
from coati.quant.llama_gptq.quant import Quantizer, QuantLinear, make_quant
# Configure and find quantization parameters
quantizer = Quantizer()
quantizer.configure(bits=4, perchannel=True, sym=True)
quantizer.find_params(weight_tensor, weight=True)
quantized_weight = quantizer.quantize(weight_tensor)
# Replace linear layers with quantized versions
names = ["model.layers.0.self_attn.q_proj", "model.layers.0.self_attn.k_proj"]
make_quant(model, names, bits=4, groupsize=128)