Principle:FMInference FlexLLMGen Group Quantization Configuration

Field	Value
Sources	Paper: FlexGen, Paper: GPTQ
Domains	Quantization, Memory_Optimization
Last Updated	2026-02-09 00:00 GMT

Overview

A compression technique that reduces tensor storage by representing values with fewer bits (e.g., 4-bit) using group-wise asymmetric quantization with per-group scale and minimum values.

Description

Group-wise quantization divides a tensor along a specified dimension into groups of fixed size, then quantizes each group independently using asymmetric min-max scaling. This allows 4x memory reduction (FP16 to 4-bit) with controlled accuracy loss. FlexLLMGen applies this to both model weights and KV cache tensors.

The key characteristics of this approach are:

Group-wise granularity -- Rather than quantizing an entire tensor with a single scale factor, the tensor is split into small groups (e.g., 64 elements). Each group has its own scale and zero-point, preserving local value distributions.
Asymmetric quantization -- Uses per-group minimum and maximum values (rather than symmetric zero-centered ranges), which better captures the actual distribution of weights and activations.
Configurable bit-width -- The number of quantization bits is configurable, with 4-bit being the default for aggressive compression.
Dimension-aware grouping -- The grouping dimension differs by tensor type: dimension 0 for weights, dimension 2 for KV cache tensors.
Dual application -- The same quantization scheme can be applied independently to weights and KV cache, each with its own configuration.

Usage

Enable group quantization when GPU memory is insufficient even with CPU/disk offloading, or to reduce I/O bandwidth requirements during offloaded inference. The compression is especially effective for offloaded tensors because it reduces the volume of data that must be transferred between tiers.

Common use cases include:

Compressing model weights stored on CPU or disk to reduce load times.
Compressing KV cache to fit longer sequences in available memory.
Reducing PCIe and NVMe bandwidth requirements during offloaded inference.

Theoretical Basis

For a group of values x in [x_min, x_max], quantization maps to n-bit integers:

q = round((x - x_min) / (x_max - x_min) * (2^n - 1))

Dequantization recovers approximate values:

x_approx = q * (x_max - x_min) / (2^n - 1) + x_min

The group size controls the granularity of the approximation. Smaller groups produce more accurate results (each group tracks its own min/max) but require more storage for the per-group metadata (scale and zero-point values). A group size of 64 provides a good balance between compression ratio and accuracy for typical LLM weight distributions.

For 4-bit quantization with group size 64:

Each group of 64 FP16 values (128 bytes) is compressed to 64 x 0.5 bytes = 32 bytes plus 4 bytes of metadata.
Effective compression ratio: approximately 3.6x.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment