Heuristic:FMInference FlexLLMGen Weight Compression 4bit
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLM_Inference, Quantization |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Enable 4-bit group-wise asymmetric quantization with group_size=64 to reduce weight and KV cache memory by ~70% with negligible accuracy loss.
Description
FlexLLMGen includes a built-in group-wise quantization engine that compresses float16 tensors to 4 bits per element. The compression stores data in packed uint8 format (2 elements per byte) with per-group scale and zero-point metadata. Weights and KV cache can be compressed independently via --compress-weight and --compress-cache flags. The default configuration uses group_size=64, asymmetric quantization, with grouping along dimension 0 for weights and dimension 2 for KV cache.
Usage
Use this heuristic when GPU or CPU memory is the bottleneck and you need to fit a larger model or increase batch size. Weight compression alone reduces weight memory by ~70% (from 2 bytes/param to ~0.6 bytes/param including scale metadata). KV cache compression similarly reduces cache memory, enabling larger effective batch sizes. FlexLLMGen with compression achieves the highest throughput in benchmarks (e.g., 29.12 tok/s for OPT-6.7B vs 25.26 without compression).
The Insight (Rule of Thumb)
- Action (weights): Add
--compress-weightto enable 4-bit weight quantization. - Action (cache): Add
--compress-cacheto enable 4-bit KV cache quantization. - Configuration defaults:
num_bits=4: 4-bit quantization (only supported value)group_size=64: Elements per quantization groupgroup_dim=0for weights (output dimension),group_dim=2for cache (sequence dimension)symmetric=False: Asymmetric quantization (scale + zero-point)
- Trade-off: ~70% memory reduction for weights. Throughput actually increases in benchmarks because larger batch sizes become possible, outweighing decompression overhead. Negligible accuracy loss.
Reasoning
The 4-bit group-wise quantization stores each group of 64 elements as:
- Data: 64 elements packed into 32 bytes (uint8, 2 elements per byte)
- Scale metadata: 2 float16 values per group (scale and zero-point = 4 bytes)
Total per group: 32 + 4 = 36 bytes for 64 elements, vs 128 bytes for uncompressed float16. This is a 2.8x compression ratio (approximately 70% reduction).
Asymmetric quantization (symmetric=False) is used because it better handles the non-zero-centered distributions common in transformer weights and attention values. The group_dim differs between weights and cache because:
- Weights (group_dim=0): Grouping along the output dimension preserves row-level statistics, which aligns with how matrix multiplication accesses weights.
- Cache (group_dim=2): Grouping along the sequence dimension allows incremental compression as new tokens are generated.
Benchmark results show compression improves end-to-end throughput:
- OPT-6.7B: 25.26 -> 29.12 tok/s (+15%)
- OPT-30B: 7.32 -> 8.38 tok/s (+14%)
- OPT-175B: 0.69 -> 1.12 tok/s (+62%)
Code Evidence
CompressionConfig dataclass from flexllmgen/compression.py:11-18:
@dataclasses.dataclass
class CompressionConfig:
"""Group-wise quantization."""
num_bits: int
group_size: int
group_dim: int
symmetric: bool
enabled: bool = True
Assertion that only 4-bit is supported in flexllmgen/compression.py:34:
assert comp_config.num_bits == 4 and dtype == np.float16
Default compression configuration in flexllmgen/apps/completion.py:35-41:
comp_weight_config=CompressionConfig(
num_bits=4, group_size=64,
group_dim=0, symmetric=False),
comp_cache_config=CompressionConfig(
num_bits=4, group_size=64,
group_dim=2, symmetric=False))