Implementation:Huggingface Optimum GPTQQuantizer Pack Model
Appearance
Overview
Packs quantized weights into compact QuantLinear layers by replacing the original linear layers with their quantized, packed equivalents.
Source
File: optimum/gptq/quantizer.py Lines: 676-714
Signature
def pack_model(
self,
model: nn.Module,
quantizers: Dict[str, Tuple],
) -> None:
Parameters
| Parameter | Type | Description |
|---|---|---|
model |
nn.Module |
The model to pack. Must have already been quantized (i.e., quantizers dict populated by the quantization loop).
|
quantizers |
Dict[str, Tuple] |
Mapping of layer names to quantization data tuples: (quantizer, scale, zero, g_idx).
|
Behavior
The packing process follows these steps:
- Collect layers — Calls
get_layers(model)to find all linear layers, then filters to only those present in thequantizersdict. - Determine device map — If the model has an
hf_device_mapattribute (multi-device deployment), uses that; otherwise, creates a single-device map from the model's first parameter device. - Select pack-capable QuantLinear — Calls
self.select_quant_linear(device_map=device_map, pack=True)to get theQuantLinearclass appropriate for packing. - Replace layers — Calls
self._replace_by_quant_layers(model, quantizers)to swap in the newQuantLinearinstances. - Pack each layer — For each quantized layer:
- Saves the layer's original device.
- Moves the
QuantLinearlayer and all quantization data (original layer, scale, zero, g_idx) to CPU. - Calls
qlayers[name].pack(layers[name], scale, zero, g_idx)to pack the quantized weights. - Moves the packed layer back to the original device.
def pack_model(self, model: nn.Module, quantizers: Dict[str, Tuple]) -> None:
logger.info("Packing model...")
layers = get_layers(model)
layers = {n: layers[n] for n in quantizers}
if hasattr(model, "hf_device_map"):
device_map = model.hf_device_map
else:
device_map = {"": next(model.parameters()).device}
self.select_quant_linear(device_map=device_map, pack=True)
self._replace_by_quant_layers(model, quantizers)
qlayers = get_layers(model, [self.quant_linear])
for name in qlayers:
quantizers[name], scale, zero, g_idx = quantizers[name]
layer_device = qlayers[name].device
qlayers[name].to("cpu")
layers[name], scale, zero, g_idx = (
layers[name].to("cpu"), scale.to("cpu"), zero.to("cpu"), g_idx.to("cpu")
)
qlayers[name].pack(layers[name], scale, zero, g_idx)
qlayers[name].to(layer_device)
logger.info("Model packed.")
Key Implementation Details
- CPU packing — All packing operations are performed on CPU regardless of the target device. This is because the
pack()method involves integer bit manipulation that is most reliably performed on CPU. - Device restoration — After packing, each layer is moved back to its original device to maintain the model's device placement.
- QuantLinear selection — The
pack=Trueflag passed toselect_quant_linear()ensures the correct packing implementation is chosen for the target backend.
Related
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment