Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Optimum Quantized Weight Packing

From Leeroopedia

Overview

Process of compressing quantized weights into compact integer representations with associated scale and zero-point parameters.

Description

After quantization, weights need to be packed into an efficient storage format. Multiple low-bit weights are packed into larger integer types (e.g., eight 4-bit weights into one int32). Each group of weights shares scale and zero-point values for dequantization. The packing replaces original nn.Linear layers with QuantLinear layers that store:

  • Packed weights (qweight) — Multiple quantized values packed into wider integers.
  • Scales (scales) — Per-group scale factors for dequantization.
  • Zeros (qzeros) — Per-group zero-point values, also packed into wider integers.
  • Activation order indices (g_idx) — Optional permutation indices when desc_act=True.

The packing process involves:

  1. Selecting the appropriate QuantLinear class based on the quantization config and device map. Different backends (ExLlama, Marlin, Triton) use different packed formats.
  2. Replacing placeholder layers with the pack-capable QuantLinear variant.
  3. Packing weights by moving layers to CPU, calling qlayer.pack(original_layer, scale, zero, g_idx), and then moving back to the original device.

Usage

Use after sequential block quantization to convert quantized parameters into the final packed format. This is called automatically by GPTQQuantizer.quantize_model() as Step 4.

Theoretical Basis

Linear quantization maps floating-point weights to integers:

q = round(w / scale) + zero_point

Dequantization recovers an approximation of the original weight:

w_approx = (q - zero_point) * scale

Packing multiple low-bit values into wider integers reduces memory footprint and enables efficient GPU kernels:

Bit Width Values per int32 Compression Ratio vs FP16
2-bit 16 8x
3-bit 10 (with padding) ~5.3x
4-bit 8 4x
8-bit 4 2x

Efficient inference kernels (e.g., ExLlama, Marlin, Triton) operate directly on packed representations, fusing the dequantization step into the matrix multiplication to minimize memory bandwidth requirements.

Related

Connections

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment