Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Optimum GPTQQuantizer Pack Model

From Leeroopedia

Overview

Packs quantized weights into compact QuantLinear layers by replacing the original linear layers with their quantized, packed equivalents.

Source

File: optimum/gptq/quantizer.py Lines: 676-714

Signature

def pack_model(
    self,
    model: nn.Module,
    quantizers: Dict[str, Tuple],
) -> None:

Parameters

Parameter Type Description
model nn.Module The model to pack. Must have already been quantized (i.e., quantizers dict populated by the quantization loop).
quantizers Dict[str, Tuple] Mapping of layer names to quantization data tuples: (quantizer, scale, zero, g_idx).

Behavior

The packing process follows these steps:

  1. Collect layers — Calls get_layers(model) to find all linear layers, then filters to only those present in the quantizers dict.
  2. Determine device map — If the model has an hf_device_map attribute (multi-device deployment), uses that; otherwise, creates a single-device map from the model's first parameter device.
  3. Select pack-capable QuantLinear — Calls self.select_quant_linear(device_map=device_map, pack=True) to get the QuantLinear class appropriate for packing.
  4. Replace layers — Calls self._replace_by_quant_layers(model, quantizers) to swap in the new QuantLinear instances.
  5. Pack each layer — For each quantized layer:
    • Saves the layer's original device.
    • Moves the QuantLinear layer and all quantization data (original layer, scale, zero, g_idx) to CPU.
    • Calls qlayers[name].pack(layers[name], scale, zero, g_idx) to pack the quantized weights.
    • Moves the packed layer back to the original device.
def pack_model(self, model: nn.Module, quantizers: Dict[str, Tuple]) -> None:
    logger.info("Packing model...")
    layers = get_layers(model)
    layers = {n: layers[n] for n in quantizers}

    if hasattr(model, "hf_device_map"):
        device_map = model.hf_device_map
    else:
        device_map = {"": next(model.parameters()).device}

    self.select_quant_linear(device_map=device_map, pack=True)
    self._replace_by_quant_layers(model, quantizers)
    qlayers = get_layers(model, [self.quant_linear])

    for name in qlayers:
        quantizers[name], scale, zero, g_idx = quantizers[name]
        layer_device = qlayers[name].device
        qlayers[name].to("cpu")
        layers[name], scale, zero, g_idx = (
            layers[name].to("cpu"), scale.to("cpu"), zero.to("cpu"), g_idx.to("cpu")
        )
        qlayers[name].pack(layers[name], scale, zero, g_idx)
        qlayers[name].to(layer_device)

    logger.info("Model packed.")

Key Implementation Details

  • CPU packing — All packing operations are performed on CPU regardless of the target device. This is because the pack() method involves integer bit manipulation that is most reliably performed on CPU.
  • Device restoration — After packing, each layer is moved back to its original device to maintain the model's device placement.
  • QuantLinear selection — The pack=True flag passed to select_quant_linear() ensures the correct packing implementation is chosen for the target backend.

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment