Heuristic:Predibase Lorax Quantization Backend Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs |
| Last Updated | 2026-02-08 02:30 GMT |
Overview
Decision framework for selecting the optimal quantization backend (GPTQ, AWQ, EETQ, BitsAndBytes, HQQ, FP8) based on GPU architecture, model format, and quality-speed trade-offs.
Description
LoRAX supports six quantization methods, each with different trade-offs between inference speed, model quality, and setup complexity. The selection is gated by GPU compute capability and whether the model was pre-quantized or will be quantized at runtime.
The quantization methods fall into two categories:
- Pre-quantized (weight-only): GPTQ, AWQ require pre-quantized model weights. Fastest inference but require upfront quantization step.
- Runtime JIT: EETQ, BitsAndBytes, HQQ, FP8 quantize weights at model load time. No pre-quantization needed but loading is slower.
Within GPTQ, an additional backend selection occurs based on GPU architecture: ExLLaMA V2 (preferred, SM 8.0+), ExLLaMA V1 (fallback), or generic Triton QuantLinear (oldest GPUs).
Usage
This heuristic applies when choosing the `--quantize` flag at server startup. The choice is permanent for the server lifetime. Consider:
- Already have a GPTQ/AWQ quantized model? Use that method.
- Want fastest inference with minimal quality loss? AWQ (4-bit) or EETQ (8-bit).
- Want to quantize any model on-the-fly? EETQ (8-bit, fast) or HQQ (2-4 bit, experimental).
- Have Hopper/Ada GPU? FP8 gives hardware-native 8-bit with minimal quality loss.
The Insight (Rule of Thumb)
- Action: Choose quantization based on this priority:
- AWQ if pre-quantized model available (fastest 4-bit, best quality)
- GPTQ if AWQ not available but GPTQ model exists (fast 4-bit)
- EETQ for runtime 8-bit quantization (drop-in replacement for BitsAndBytes, faster)
- FP8 if on Hopper/Ada GPU (hardware-native, minimal quality loss)
- HQQ for aggressive 2-3 bit quantization (experimental)
- BitsAndBytes only as last resort (deprecated in LoRAX)
- Value: 4-bit methods halve VRAM vs FP16. 8-bit methods save ~40% VRAM.
- Trade-off: Lower precision = less VRAM + faster inference, but increased perplexity. Pre-quantized methods have better quality than runtime methods at same bit-width.
Reasoning
GPTQ backend selection logic reveals the GPU-gating hierarchy:
- ExLLaMA V2 (SM 8.0+ or ROCm): Fastest GPTQ implementation. Uses optimized CUDA kernels with per-channel quantization.
- ExLLaMA V1: Fallback when V2 import fails. Older kernel implementation.
- Triton QuantLinear: Generic fallback for all GPUs. Uses Triton JIT compilation. Slower but universally compatible.
BitsAndBytes is explicitly deprecated in LoRAX with a logger.warning recommending EETQ as a "drop-in replacement with better performance". This is because EETQ uses optimized CUDA kernels while BitsAndBytes uses generic CUDA implementations.
FP8 requires SM 8.9+ (Ada Lovelace) or SM 9.0+ (Hopper) because these architectures have hardware FP8 tensor cores. On older GPUs, FP8 would require software emulation and provide no benefit.
Code evidence from `server/lorax_server/layers/gptq/__init__.py:1-45`:
major, _minor = torch.cuda.get_device_capability()
CAN_EXLLAMA = major >= 8 or SYSTEM == "rocm"
V2 = os.getenv("EXLLAMA_VERSION", "2") == "2"
if os.getenv("DISABLE_EXLLAMA") == "True":
HAS_EXLLAMA = False
elif CAN_EXLLAMA:
try:
if V2:
from lorax_server.layers.gptq.exllamav2 import QuantLinear
HAS_EXLLAMA = "2"
else:
from lorax_server.layers.gptq.exllama import Ex4bitLinear
HAS_EXLLAMA = "1"
except ImportError:
pass
BitsAndBytes deprecation from `server/lorax_server/layers/bnb.py:7-12`:
def warn_deprecate_bnb():
logger.warning(
"Bitsandbytes 8bit is deprecated, using `eetq` is a drop-in replacement "
"with better performance"
)
FP8 gating from `server/lorax_server/utils/torch_utils.py:17-22`:
def is_fp8_supported():
return (
torch.cuda.is_available()
and (torch.cuda.get_device_capability()[0] >= 9)
or (torch.cuda.get_device_capability()[0] == 8
and torch.cuda.get_device_capability()[1] >= 9)
)