Principle:FMInference FlexLLMGen Group Quantization Configuration
| Field | Value |
|---|---|
| Sources | Paper: FlexGen, Paper: GPTQ |
| Domains | Quantization, Memory_Optimization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A compression technique that reduces tensor storage by representing values with fewer bits (e.g., 4-bit) using group-wise asymmetric quantization with per-group scale and minimum values.
Description
Group-wise quantization divides a tensor along a specified dimension into groups of fixed size, then quantizes each group independently using asymmetric min-max scaling. This allows 4x memory reduction (FP16 to 4-bit) with controlled accuracy loss. FlexLLMGen applies this to both model weights and KV cache tensors.
The key characteristics of this approach are:
- Group-wise granularity -- Rather than quantizing an entire tensor with a single scale factor, the tensor is split into small groups (e.g., 64 elements). Each group has its own scale and zero-point, preserving local value distributions.
- Asymmetric quantization -- Uses per-group minimum and maximum values (rather than symmetric zero-centered ranges), which better captures the actual distribution of weights and activations.
- Configurable bit-width -- The number of quantization bits is configurable, with 4-bit being the default for aggressive compression.
- Dimension-aware grouping -- The grouping dimension differs by tensor type: dimension 0 for weights, dimension 2 for KV cache tensors.
- Dual application -- The same quantization scheme can be applied independently to weights and KV cache, each with its own configuration.
Usage
Enable group quantization when GPU memory is insufficient even with CPU/disk offloading, or to reduce I/O bandwidth requirements during offloaded inference. The compression is especially effective for offloaded tensors because it reduces the volume of data that must be transferred between tiers.
Common use cases include:
- Compressing model weights stored on CPU or disk to reduce load times.
- Compressing KV cache to fit longer sequences in available memory.
- Reducing PCIe and NVMe bandwidth requirements during offloaded inference.
Theoretical Basis
For a group of values x in [x_min, x_max], quantization maps to n-bit integers:
q = round((x - x_min) / (x_max - x_min) * (2^n - 1))
Dequantization recovers approximate values:
x_approx = q * (x_max - x_min) / (2^n - 1) + x_min
The group size controls the granularity of the approximation. Smaller groups produce more accurate results (each group tracks its own min/max) but require more storage for the per-group metadata (scale and zero-point values). A group size of 64 provides a good balance between compression ratio and accuracy for typical LLM weight distributions.
For 4-bit quantization with group size 64:
- Each group of 64 FP16 values (128 bytes) is compressed to 64 x 0.5 bytes = 32 bytes plus 4 bytes of metadata.
- Effective compression ratio: approximately 3.6x.