Principle:Huggingface Optimum Quantized Weight Packing

Overview

Process of compressing quantized weights into compact integer representations with associated scale and zero-point parameters.

Description

After quantization, weights need to be packed into an efficient storage format. Multiple low-bit weights are packed into larger integer types (e.g., eight 4-bit weights into one int32). Each group of weights shares scale and zero-point values for dequantization. The packing replaces original nn.Linear layers with QuantLinear layers that store:

Packed weights (qweight) — Multiple quantized values packed into wider integers.
Scales (scales) — Per-group scale factors for dequantization.
Zeros (qzeros) — Per-group zero-point values, also packed into wider integers.
Activation order indices (g_idx) — Optional permutation indices when desc_act=True.

The packing process involves:

Selecting the appropriate QuantLinear class based on the quantization config and device map. Different backends (ExLlama, Marlin, Triton) use different packed formats.
Replacing placeholder layers with the pack-capable QuantLinear variant.
Packing weights by moving layers to CPU, calling qlayer.pack(original_layer, scale, zero, g_idx), and then moving back to the original device.

Usage

Use after sequential block quantization to convert quantized parameters into the final packed format. This is called automatically by GPTQQuantizer.quantize_model() as Step 4.

Theoretical Basis

Linear quantization maps floating-point weights to integers:

q = round(w / scale) + zero_point

Dequantization recovers an approximation of the original weight:

w_approx = (q - zero_point) * scale

Packing multiple low-bit values into wider integers reduces memory footprint and enables efficient GPU kernels:

Bit Width	Values per int32	Compression Ratio vs FP16
2-bit	16	8x
3-bit	10 (with padding)	~5.3x
4-bit	8	4x
8-bit	4	2x

Efficient inference kernels (e.g., ExLlama, Marlin, Triton) operate directly on packed representations, fusing the dequantization step into the matrix multiplication to minimize memory bandwidth requirements.

Connections

Implementation:Huggingface_Optimum_GPTQQuantizer_Pack_Model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment