Heuristic:FMInference FlexLLMGen Weight Compression 4bit

Knowledge Sources	FlexLLMGen Compression Module
Domains	Optimization, LLM_Inference, Quantization
Last Updated	2026-02-09 12:00 GMT

Overview

Enable 4-bit group-wise asymmetric quantization with group_size=64 to reduce weight and KV cache memory by ~70% with negligible accuracy loss.

Description

FlexLLMGen includes a built-in group-wise quantization engine that compresses float16 tensors to 4 bits per element. The compression stores data in packed uint8 format (2 elements per byte) with per-group scale and zero-point metadata. Weights and KV cache can be compressed independently via --compress-weight and --compress-cache flags. The default configuration uses group_size=64, asymmetric quantization, with grouping along dimension 0 for weights and dimension 2 for KV cache.

Usage

Use this heuristic when GPU or CPU memory is the bottleneck and you need to fit a larger model or increase batch size. Weight compression alone reduces weight memory by ~70% (from 2 bytes/param to ~0.6 bytes/param including scale metadata). KV cache compression similarly reduces cache memory, enabling larger effective batch sizes. FlexLLMGen with compression achieves the highest throughput in benchmarks (e.g., 29.12 tok/s for OPT-6.7B vs 25.26 without compression).

The Insight (Rule of Thumb)

Action (weights): Add --compress-weight to enable 4-bit weight quantization.
Action (cache): Add --compress-cache to enable 4-bit KV cache quantization.
Configuration defaults:
- num_bits=4: 4-bit quantization (only supported value)
- group_size=64: Elements per quantization group
- group_dim=0 for weights (output dimension), group_dim=2 for cache (sequence dimension)
- symmetric=False: Asymmetric quantization (scale + zero-point)
Trade-off: ~70% memory reduction for weights. Throughput actually increases in benchmarks because larger batch sizes become possible, outweighing decompression overhead. Negligible accuracy loss.

Reasoning

The 4-bit group-wise quantization stores each group of 64 elements as:

Data: 64 elements packed into 32 bytes (uint8, 2 elements per byte)
Scale metadata: 2 float16 values per group (scale and zero-point = 4 bytes)

Total per group: 32 + 4 = 36 bytes for 64 elements, vs 128 bytes for uncompressed float16. This is a 2.8x compression ratio (approximately 70% reduction).

Asymmetric quantization (symmetric=False) is used because it better handles the non-zero-centered distributions common in transformer weights and attention values. The group_dim differs between weights and cache because:

Weights (group_dim=0): Grouping along the output dimension preserves row-level statistics, which aligns with how matrix multiplication accesses weights.
Cache (group_dim=2): Grouping along the sequence dimension allows incremental compression as new tokens are generated.

Benchmark results show compression improves end-to-end throughput:

OPT-6.7B: 25.26 -> 29.12 tok/s (+15%)
OPT-30B: 7.32 -> 8.38 tok/s (+14%)
OPT-175B: 0.69 -> 1.12 tok/s (+62%)

Code Evidence

CompressionConfig dataclass from flexllmgen/compression.py:11-18:

@dataclasses.dataclass
class CompressionConfig:
    """Group-wise quantization."""
    num_bits: int
    group_size: int
    group_dim: int
    symmetric: bool
    enabled: bool = True

Assertion that only 4-bit is supported in flexllmgen/compression.py:34:

assert comp_config.num_bits == 4 and dtype == np.float16

Default compression configuration in flexllmgen/apps/completion.py:35-41:

comp_weight_config=CompressionConfig(
    num_bits=4, group_size=64,
    group_dim=0, symmetric=False),
comp_cache_config=CompressionConfig(
    num_bits=4, group_size=64,
    group_dim=2, symmetric=False))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment