Implementation:Huggingface Optimum GPTQ Fasterquant
Overview
The sequential block quantization loop in GPTQQuantizer.quantize_model() that creates GPTQ solver instances, accumulates Hessian statistics via hooks, and calls fasterquant() to solve for optimal quantized weights. This is a Wrapper Doc — the core GPTQ class and fasterquant() implementation live in the external gptqmodel package; this page documents how optimum uses them.
Source
File: optimum/gptq/quantizer.py
- Block loop: Lines 538-635
- GPTQ solver usage: Lines 578-612
External: gptqmodel.quantization.GPTQ
APIs Used (from gptqmodel)
| API | Signature | Description |
|---|---|---|
GPTQ(layer, qcfg) |
Constructor | Creates a GPTQ solver for a single linear layer with the given quantization config. |
GPTQ.add_batch(input, output) |
Method | Accumulates a batch of input/output data into the Hessian matrix. |
GPTQ.fasterquant(percdamp, group_size, actorder) |
Method | Solves for optimal quantized weights. Returns (scale, zero, g_idx, ...).
|
GPTQ.free() |
Method | Releases memory held by the solver. |
Block Quantization Loop
The main quantization loop iterates over all transformer blocks:
quantizers = {}
for i, block in enumerate(tqdm(blocks, desc=f"Quantizing {self.block_name_to_quantize} blocks ")):
# Move block to GPU if needed
if (not has_device_map or get_device(block) == torch.device("cpu")) and has_device_more_than_cpu():
block = block.to(0)
layers = get_layers(block)
# ...
Layer Subset Processing
Within each block, layers are grouped for quantization based on the true_sequential and modules_in_block_to_quantize settings:
| Setting | Behavior |
|---|---|
true_sequential=True, no custom modules |
Each layer is quantized individually: [[layer1], [layer2], ...]
|
true_sequential=False, no custom modules |
All layers quantized together: layer1, layer2, ...
|
true_sequential=True, custom modules |
User-defined layer groups are processed sequentially. |
true_sequential=False, custom modules |
All custom module groups are flattened into one batch. |
GPTQ Solver Usage Pattern
For each subset of layers:
# Step 1: Create GPTQ solvers and register hooks
for name in subset_layers:
gptq[name] = GPTQ(subset_layers[name], qcfg=self.quantizeConfig)
gptq[name].quantizer.configure(bits=self.bits, sym=self.sym, perchannel=True)
def add_batch(name):
def tmp(_, input, output):
gptq[name].add_batch(input[0].data, output.data)
return tmp
handles.append(subset_layers[name].register_forward_hook(add_batch(name)))
# Step 2: Run calibration data through the block to accumulate Hessian
for j in range(len(dataset)):
layer_inputs[j] = nested_move_to(layer_inputs[j], block_device)
for k, v in layer_input_kwargs[j].items():
layer_input_kwargs[j][k] = nested_move_to(v, block_device)
block(*layer_inputs[j], **layer_input_kwargs[j])
# Step 3: Remove hooks
for h in handles:
h.remove()
# Step 4: Solve for quantized weights
for name in subset_name_list:
quant_outputs = gptq[name].fasterquant(
percdamp=self.damp_percent,
group_size=self.group_size,
actorder=self.desc_act,
)
scale, zero, g_idx = quant_outputs[0], quant_outputs[1], quant_outputs[2]
quantizers[f"{self.block_name_to_quantize}.{i}.{name}"] = (
gptq[name].quantizer, scale, zero, g_idx,
)
gptq[name].free()
Output Propagation
After quantizing a block, the updated block output is captured and used as input for the next block:
if self.cache_block_outputs:
for j in range(len(dataset)):
layer_output = block(*layer_inputs[j], **layer_input_kwargs[j])
primary = layer_output[0] if isinstance(layer_output, tuple) else layer_output
primary = nested_move_to(primary, device=cur_layer_device)
layer_outputs.append([primary])
# Swap: outputs become inputs for the next block
layer_inputs, layer_outputs = layer_outputs, []
External Dependencies
| Package | Import | Usage |
|---|---|---|
gptqmodel |
from gptqmodel.quantization import GPTQ |
Core GPTQ solver: Hessian accumulation and fasterquant algorithm. |
gptqmodel |
from gptqmodel import QuantizeConfig |
Configuration passed to GPTQ constructor.
|