Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Optimum GPTQ Fasterquant

From Leeroopedia

Overview

The sequential block quantization loop in GPTQQuantizer.quantize_model() that creates GPTQ solver instances, accumulates Hessian statistics via hooks, and calls fasterquant() to solve for optimal quantized weights. This is a Wrapper Doc — the core GPTQ class and fasterquant() implementation live in the external gptqmodel package; this page documents how optimum uses them.

Source

File: optimum/gptq/quantizer.py

  • Block loop: Lines 538-635
  • GPTQ solver usage: Lines 578-612

External: gptqmodel.quantization.GPTQ

APIs Used (from gptqmodel)

API Signature Description
GPTQ(layer, qcfg) Constructor Creates a GPTQ solver for a single linear layer with the given quantization config.
GPTQ.add_batch(input, output) Method Accumulates a batch of input/output data into the Hessian matrix.
GPTQ.fasterquant(percdamp, group_size, actorder) Method Solves for optimal quantized weights. Returns (scale, zero, g_idx, ...).
GPTQ.free() Method Releases memory held by the solver.

Block Quantization Loop

The main quantization loop iterates over all transformer blocks:

quantizers = {}
for i, block in enumerate(tqdm(blocks, desc=f"Quantizing {self.block_name_to_quantize} blocks ")):
    # Move block to GPU if needed
    if (not has_device_map or get_device(block) == torch.device("cpu")) and has_device_more_than_cpu():
        block = block.to(0)
    layers = get_layers(block)
    # ...

Layer Subset Processing

Within each block, layers are grouped for quantization based on the true_sequential and modules_in_block_to_quantize settings:

Setting Behavior
true_sequential=True, no custom modules Each layer is quantized individually: [[layer1], [layer2], ...]
true_sequential=False, no custom modules All layers quantized together: layer1, layer2, ...
true_sequential=True, custom modules User-defined layer groups are processed sequentially.
true_sequential=False, custom modules All custom module groups are flattened into one batch.

GPTQ Solver Usage Pattern

For each subset of layers:

# Step 1: Create GPTQ solvers and register hooks
for name in subset_layers:
    gptq[name] = GPTQ(subset_layers[name], qcfg=self.quantizeConfig)
    gptq[name].quantizer.configure(bits=self.bits, sym=self.sym, perchannel=True)

    def add_batch(name):
        def tmp(_, input, output):
            gptq[name].add_batch(input[0].data, output.data)
        return tmp

    handles.append(subset_layers[name].register_forward_hook(add_batch(name)))

# Step 2: Run calibration data through the block to accumulate Hessian
for j in range(len(dataset)):
    layer_inputs[j] = nested_move_to(layer_inputs[j], block_device)
    for k, v in layer_input_kwargs[j].items():
        layer_input_kwargs[j][k] = nested_move_to(v, block_device)
    block(*layer_inputs[j], **layer_input_kwargs[j])

# Step 3: Remove hooks
for h in handles:
    h.remove()

# Step 4: Solve for quantized weights
for name in subset_name_list:
    quant_outputs = gptq[name].fasterquant(
        percdamp=self.damp_percent,
        group_size=self.group_size,
        actorder=self.desc_act,
    )
    scale, zero, g_idx = quant_outputs[0], quant_outputs[1], quant_outputs[2]
    quantizers[f"{self.block_name_to_quantize}.{i}.{name}"] = (
        gptq[name].quantizer, scale, zero, g_idx,
    )
    gptq[name].free()

Output Propagation

After quantizing a block, the updated block output is captured and used as input for the next block:

if self.cache_block_outputs:
    for j in range(len(dataset)):
        layer_output = block(*layer_inputs[j], **layer_input_kwargs[j])
        primary = layer_output[0] if isinstance(layer_output, tuple) else layer_output
        primary = nested_move_to(primary, device=cur_layer_device)
        layer_outputs.append([primary])
    # Swap: outputs become inputs for the next block
    layer_inputs, layer_outputs = layer_outputs, []

External Dependencies

Package Import Usage
gptqmodel from gptqmodel.quantization import GPTQ Core GPTQ solver: Hessian accumulation and fasterquant algorithm.
gptqmodel from gptqmodel import QuantizeConfig Configuration passed to GPTQ constructor.

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment