Implementation:Huggingface Optimum GPTQQuantizer Pack Model

Overview

Packs quantized weights into compact QuantLinear layers by replacing the original linear layers with their quantized, packed equivalents.

Source

File: optimum/gptq/quantizer.py Lines: 676-714

Signature

def pack_model(
    self,
    model: nn.Module,
    quantizers: Dict[str, Tuple],
) -> None:

Parameters

Parameter	Type	Description
`model`	`nn.Module`	The model to pack. Must have already been quantized (i.e., `quantizers` dict populated by the quantization loop).
`quantizers`	`Dict[str, Tuple]`	Mapping of layer names to quantization data tuples: `(quantizer, scale, zero, g_idx)`.

Behavior

The packing process follows these steps:

Collect layers — Calls get_layers(model) to find all linear layers, then filters to only those present in the quantizers dict.
Determine device map — If the model has an hf_device_map attribute (multi-device deployment), uses that; otherwise, creates a single-device map from the model's first parameter device.
Select pack-capable QuantLinear — Calls self.select_quant_linear(device_map=device_map, pack=True) to get the QuantLinear class appropriate for packing.
Replace layers — Calls self._replace_by_quant_layers(model, quantizers) to swap in the new QuantLinear instances.
Pack each layer — For each quantized layer:
- Saves the layer's original device.
- Moves the QuantLinear layer and all quantization data (original layer, scale, zero, g_idx) to CPU.
- Calls qlayers[name].pack(layers[name], scale, zero, g_idx) to pack the quantized weights.
- Moves the packed layer back to the original device.

def pack_model(self, model: nn.Module, quantizers: Dict[str, Tuple]) -> None:
    logger.info("Packing model...")
    layers = get_layers(model)
    layers = {n: layers[n] for n in quantizers}

    if hasattr(model, "hf_device_map"):
        device_map = model.hf_device_map
    else:
        device_map = {"": next(model.parameters()).device}

    self.select_quant_linear(device_map=device_map, pack=True)
    self._replace_by_quant_layers(model, quantizers)
    qlayers = get_layers(model, [self.quant_linear])

    for name in qlayers:
        quantizers[name], scale, zero, g_idx = quantizers[name]
        layer_device = qlayers[name].device
        qlayers[name].to("cpu")
        layers[name], scale, zero, g_idx = (
            layers[name].to("cpu"), scale.to("cpu"), zero.to("cpu"), g_idx.to("cpu")
        )
        qlayers[name].pack(layers[name], scale, zero, g_idx)
        qlayers[name].to(layer_device)

    logger.info("Model packed.")

Key Implementation Details

CPU packing — All packing operations are performed on CPU regardless of the target device. This is because the pack() method involves integer bit manipulation that is most reliably performed on CPU.
Device restoration — After packing, each layer is moved back to its original device to maintain the model's device placement.
QuantLinear selection — The pack=True flag passed to select_quant_linear() ensures the correct packing implementation is chosen for the target backend.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment