Implementation:Huggingface Optimum GPTQ Fasterquant

Overview

The sequential block quantization loop in GPTQQuantizer.quantize_model() that creates GPTQ solver instances, accumulates Hessian statistics via hooks, and calls fasterquant() to solve for optimal quantized weights. This is a Wrapper Doc — the core GPTQ class and fasterquant() implementation live in the external gptqmodel package; this page documents how optimum uses them.

Source

File: optimum/gptq/quantizer.py

Block loop: Lines 538-635
GPTQ solver usage: Lines 578-612

External: gptqmodel.quantization.GPTQ

APIs Used (from gptqmodel)

API	Signature	Description
`GPTQ(layer, qcfg)`	Constructor	Creates a GPTQ solver for a single linear layer with the given quantization config.
`GPTQ.add_batch(input, output)`	Method	Accumulates a batch of input/output data into the Hessian matrix.
`GPTQ.fasterquant(percdamp, group_size, actorder)`	Method	Solves for optimal quantized weights. Returns `(scale, zero, g_idx, ...)`.
`GPTQ.free()`	Method	Releases memory held by the solver.

Block Quantization Loop

The main quantization loop iterates over all transformer blocks:

quantizers = {}
for i, block in enumerate(tqdm(blocks, desc=f"Quantizing {self.block_name_to_quantize} blocks ")):
    # Move block to GPU if needed
    if (not has_device_map or get_device(block) == torch.device("cpu")) and has_device_more_than_cpu():
        block = block.to(0)
    layers = get_layers(block)
    # ...

Layer Subset Processing

Within each block, layers are grouped for quantization based on the true_sequential and modules_in_block_to_quantize settings:

Setting	Behavior
`true_sequential=True`, no custom modules	Each layer is quantized individually: `[[layer1], [layer2], ...]`
`true_sequential=False`, no custom modules	All layers quantized together: `layer1, layer2, ...`
`true_sequential=True`, custom modules	User-defined layer groups are processed sequentially.
`true_sequential=False`, custom modules	All custom module groups are flattened into one batch.

GPTQ Solver Usage Pattern

For each subset of layers:

# Step 1: Create GPTQ solvers and register hooks
for name in subset_layers:
    gptq[name] = GPTQ(subset_layers[name], qcfg=self.quantizeConfig)
    gptq[name].quantizer.configure(bits=self.bits, sym=self.sym, perchannel=True)

    def add_batch(name):
        def tmp(_, input, output):
            gptq[name].add_batch(input[0].data, output.data)
        return tmp

    handles.append(subset_layers[name].register_forward_hook(add_batch(name)))

# Step 2: Run calibration data through the block to accumulate Hessian
for j in range(len(dataset)):
    layer_inputs[j] = nested_move_to(layer_inputs[j], block_device)
    for k, v in layer_input_kwargs[j].items():
        layer_input_kwargs[j][k] = nested_move_to(v, block_device)
    block(*layer_inputs[j], **layer_input_kwargs[j])

# Step 3: Remove hooks
for h in handles:
    h.remove()

# Step 4: Solve for quantized weights
for name in subset_name_list:
    quant_outputs = gptq[name].fasterquant(
        percdamp=self.damp_percent,
        group_size=self.group_size,
        actorder=self.desc_act,
    )
    scale, zero, g_idx = quant_outputs[0], quant_outputs[1], quant_outputs[2]
    quantizers[f"{self.block_name_to_quantize}.{i}.{name}"] = (
        gptq[name].quantizer, scale, zero, g_idx,
    )
    gptq[name].free()

Output Propagation

After quantizing a block, the updated block output is captured and used as input for the next block:

if self.cache_block_outputs:
    for j in range(len(dataset)):
        layer_output = block(*layer_inputs[j], **layer_input_kwargs[j])
        primary = layer_output[0] if isinstance(layer_output, tuple) else layer_output
        primary = nested_move_to(primary, device=cur_layer_device)
        layer_outputs.append([primary])
    # Swap: outputs become inputs for the next block
    layer_inputs, layer_outputs = layer_outputs, []

External Dependencies

Package	Import	Usage
`gptqmodel`	`from gptqmodel.quantization import GPTQ`	Core GPTQ solver: Hessian accumulation and fasterquant algorithm.
`gptqmodel`	`from gptqmodel import QuantizeConfig`	Configuration passed to `GPTQ` constructor.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment