Heuristic:Huggingface Optimum GPTQ Quantization Defaults
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Optimization |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Recommended default values for GPTQ quantization: group_size=128, damp_percent=0.1, batch_size=1, sym=True, with automatic format upgrade to gptq_v2 when using gptqmodel.
Description
The `GPTQQuantizer` class defines several important default parameters that represent best-practice values for GPTQ quantization. These defaults balance quantization quality, memory usage, and compatibility. Understanding these defaults is critical because changing them can significantly affect model quality, quantization speed, and compatibility with inference backends.
Usage
Apply this heuristic when configuring GPTQ quantization parameters. The defaults are optimized for the common case (symmetric 4-bit quantization with group size 128). Deviate from these defaults only when you have specific requirements (e.g., asymmetric quantization for better accuracy, or per-column quantization with group_size=-1).
The Insight (Rule of Thumb)
- Action: Use the default `group_size=128` for GPTQ quantization.
- Value: 128 (per-group quantization). Use `-1` for per-column quantization.
- Trade-off: Smaller group sizes improve accuracy but increase the quantization metadata size.
- Action: Use the default `damp_percent=0.1` (10% dampening on Hessian diagonal).
- Value: 0.1.
- Trade-off: Higher values increase numerical stability but may slightly reduce quantization accuracy.
- Action: Use `batch_size=1` during quantization for safety.
- Value: 1 (default). Higher values speed up quantization but increase VRAM usage.
- Trade-off: batch_size=1 is slower but safer; increase only if VRAM allows.
- Action: Use `sym=True` (symmetric quantization) for maximum compatibility.
- Value: True (default). Asymmetric (`sym=False`) requires `gptqmodel` and is not compatible with `auto-gptq`.
- Trade-off: Asymmetric quantization can be more accurate but limits backend compatibility.
- Action: Let gptqmodel auto-upgrade format to `gptq_v2`.
- Value: When `gptqmodel` is available, the format is automatically set to `gptq_v2` internally (even if user specifies `gptq`). The output is converted back to v1 for compatibility.
- Trade-off: v2 format supports asymmetric quantization; v1 is more widely compatible.
- Action: Set `model.config.use_cache = False` during quantization.
- Value: Always disabled during quantization, restored afterward.
- Trade-off: None. KV cache must be disabled during quantization to avoid interference with activation capture.
Reasoning
The defaults were chosen based on the original GPTQ paper and practical experience:
- group_size=128 provides a good balance between quantization granularity and overhead. Per-column quantization (group_size=-1) loses fine-grained grouping but has zero overhead.
- damp_percent=0.1 adds 10% of the maximum Hessian diagonal value to stabilize the inverse computation, preventing numerical issues with near-singular matrices.
- batch_size=1 is conservative to avoid OOM errors during quantization, since the quantizer needs to store intermediate activations for all calibration samples in the batch.
- sym=True is the default because it is supported by both `auto-gptq` and `gptqmodel`. Asymmetric quantization was only added with `gptqmodel` support.
- Format auto-upgrade: The code at `quantizer.py:403-405` automatically upgrades to gptq_v2 internally because gptqmodel's internal representation uses v2 for asymmetric support. The output is converted back to v1 at `quantizer.py:739-741` for maximum compatibility.
Code evidence from `optimum/gptq/quantizer.py:69-87`:
def __init__(
self,
bits: int,
dataset: Optional[Union[List[str], str]] = None,
group_size: int = 128,
damp_percent: float = 0.1,
desc_act: bool = False,
sym: bool = True,
true_sequential: bool = True,
batch_size: int = 1,
...
):
Format auto-upgrade from `optimum/gptq/quantizer.py:403-405`:
# gptqmodel internal is gptq_v2 for asym support, gptq(v1) can only support sym=True
if is_gptqmodel_available() and self.format != "gptq_v2":
self.format = "gptq_v2"