Principle:Huggingface Optimum Quantized Weight Packing
Overview
Process of compressing quantized weights into compact integer representations with associated scale and zero-point parameters.
Description
After quantization, weights need to be packed into an efficient storage format. Multiple low-bit weights are packed into larger integer types (e.g., eight 4-bit weights into one int32). Each group of weights shares scale and zero-point values for dequantization. The packing replaces original nn.Linear layers with QuantLinear layers that store:
- Packed weights (
qweight) — Multiple quantized values packed into wider integers. - Scales (
scales) — Per-group scale factors for dequantization. - Zeros (
qzeros) — Per-group zero-point values, also packed into wider integers. - Activation order indices (
g_idx) — Optional permutation indices whendesc_act=True.
The packing process involves:
- Selecting the appropriate
QuantLinearclass based on the quantization config and device map. Different backends (ExLlama, Marlin, Triton) use different packed formats. - Replacing placeholder layers with the pack-capable
QuantLinearvariant. - Packing weights by moving layers to CPU, calling
qlayer.pack(original_layer, scale, zero, g_idx), and then moving back to the original device.
Usage
Use after sequential block quantization to convert quantized parameters into the final packed format. This is called automatically by GPTQQuantizer.quantize_model() as Step 4.
Theoretical Basis
Linear quantization maps floating-point weights to integers:
q = round(w / scale) + zero_point
Dequantization recovers an approximation of the original weight:
w_approx = (q - zero_point) * scale
Packing multiple low-bit values into wider integers reduces memory footprint and enables efficient GPU kernels:
| Bit Width | Values per int32 | Compression Ratio vs FP16 |
|---|---|---|
| 2-bit | 16 | 8x |
| 3-bit | 10 (with padding) | ~5.3x |
| 4-bit | 8 | 4x |
| 8-bit | 4 | 2x |
Efficient inference kernels (e.g., ExLlama, Marlin, Triton) operate directly on packed representations, fusing the dequantization step into the matrix multiplication to minimize memory bandwidth requirements.
Related
- implemented_by → Implementation:Huggingface_Optimum_GPTQQuantizer_Pack_Model