Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:FMInference FlexLLMGen Group Quantization Configuration

From Leeroopedia


Field Value
Sources Paper: FlexGen, Paper: GPTQ
Domains Quantization, Memory_Optimization
Last Updated 2026-02-09 00:00 GMT

Overview

A compression technique that reduces tensor storage by representing values with fewer bits (e.g., 4-bit) using group-wise asymmetric quantization with per-group scale and minimum values.

Description

Group-wise quantization divides a tensor along a specified dimension into groups of fixed size, then quantizes each group independently using asymmetric min-max scaling. This allows 4x memory reduction (FP16 to 4-bit) with controlled accuracy loss. FlexLLMGen applies this to both model weights and KV cache tensors.

The key characteristics of this approach are:

  • Group-wise granularity -- Rather than quantizing an entire tensor with a single scale factor, the tensor is split into small groups (e.g., 64 elements). Each group has its own scale and zero-point, preserving local value distributions.
  • Asymmetric quantization -- Uses per-group minimum and maximum values (rather than symmetric zero-centered ranges), which better captures the actual distribution of weights and activations.
  • Configurable bit-width -- The number of quantization bits is configurable, with 4-bit being the default for aggressive compression.
  • Dimension-aware grouping -- The grouping dimension differs by tensor type: dimension 0 for weights, dimension 2 for KV cache tensors.
  • Dual application -- The same quantization scheme can be applied independently to weights and KV cache, each with its own configuration.

Usage

Enable group quantization when GPU memory is insufficient even with CPU/disk offloading, or to reduce I/O bandwidth requirements during offloaded inference. The compression is especially effective for offloaded tensors because it reduces the volume of data that must be transferred between tiers.

Common use cases include:

  • Compressing model weights stored on CPU or disk to reduce load times.
  • Compressing KV cache to fit longer sequences in available memory.
  • Reducing PCIe and NVMe bandwidth requirements during offloaded inference.

Theoretical Basis

For a group of values x in [x_min, x_max], quantization maps to n-bit integers:

q = round((x - x_min) / (x_max - x_min) * (2^n - 1))

Dequantization recovers approximate values:

x_approx = q * (x_max - x_min) / (2^n - 1) + x_min

The group size controls the granularity of the approximation. Smaller groups produce more accurate results (each group tracks its own min/max) but require more storage for the per-group metadata (scale and zero-point values). A group size of 64 provides a good balance between compression ratio and accuracy for typical LLM weight distributions.

For 4-bit quantization with group size 64:

  • Each group of 64 FP16 values (128 bytes) is compressed to 64 x 0.5 bytes = 32 bytes plus 4 bytes of metadata.
  • Effective compression ratio: approximately 3.6x.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment