Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Predibase Lorax Quantization Backend Selection

From Leeroopedia




Knowledge Sources
Domains Optimization, LLMs
Last Updated 2026-02-08 02:30 GMT

Overview

Decision framework for selecting the optimal quantization backend (GPTQ, AWQ, EETQ, BitsAndBytes, HQQ, FP8) based on GPU architecture, model format, and quality-speed trade-offs.

Description

LoRAX supports six quantization methods, each with different trade-offs between inference speed, model quality, and setup complexity. The selection is gated by GPU compute capability and whether the model was pre-quantized or will be quantized at runtime.

The quantization methods fall into two categories:

  • Pre-quantized (weight-only): GPTQ, AWQ require pre-quantized model weights. Fastest inference but require upfront quantization step.
  • Runtime JIT: EETQ, BitsAndBytes, HQQ, FP8 quantize weights at model load time. No pre-quantization needed but loading is slower.

Within GPTQ, an additional backend selection occurs based on GPU architecture: ExLLaMA V2 (preferred, SM 8.0+), ExLLaMA V1 (fallback), or generic Triton QuantLinear (oldest GPUs).

Usage

This heuristic applies when choosing the `--quantize` flag at server startup. The choice is permanent for the server lifetime. Consider:

  • Already have a GPTQ/AWQ quantized model? Use that method.
  • Want fastest inference with minimal quality loss? AWQ (4-bit) or EETQ (8-bit).
  • Want to quantize any model on-the-fly? EETQ (8-bit, fast) or HQQ (2-4 bit, experimental).
  • Have Hopper/Ada GPU? FP8 gives hardware-native 8-bit with minimal quality loss.

The Insight (Rule of Thumb)

  • Action: Choose quantization based on this priority:
    1. AWQ if pre-quantized model available (fastest 4-bit, best quality)
    2. GPTQ if AWQ not available but GPTQ model exists (fast 4-bit)
    3. EETQ for runtime 8-bit quantization (drop-in replacement for BitsAndBytes, faster)
    4. FP8 if on Hopper/Ada GPU (hardware-native, minimal quality loss)
    5. HQQ for aggressive 2-3 bit quantization (experimental)
    6. BitsAndBytes only as last resort (deprecated in LoRAX)
  • Value: 4-bit methods halve VRAM vs FP16. 8-bit methods save ~40% VRAM.
  • Trade-off: Lower precision = less VRAM + faster inference, but increased perplexity. Pre-quantized methods have better quality than runtime methods at same bit-width.

Reasoning

GPTQ backend selection logic reveals the GPU-gating hierarchy:

  • ExLLaMA V2 (SM 8.0+ or ROCm): Fastest GPTQ implementation. Uses optimized CUDA kernels with per-channel quantization.
  • ExLLaMA V1: Fallback when V2 import fails. Older kernel implementation.
  • Triton QuantLinear: Generic fallback for all GPUs. Uses Triton JIT compilation. Slower but universally compatible.

BitsAndBytes is explicitly deprecated in LoRAX with a logger.warning recommending EETQ as a "drop-in replacement with better performance". This is because EETQ uses optimized CUDA kernels while BitsAndBytes uses generic CUDA implementations.

FP8 requires SM 8.9+ (Ada Lovelace) or SM 9.0+ (Hopper) because these architectures have hardware FP8 tensor cores. On older GPUs, FP8 would require software emulation and provide no benefit.

Code evidence from `server/lorax_server/layers/gptq/__init__.py:1-45`:

major, _minor = torch.cuda.get_device_capability()
CAN_EXLLAMA = major >= 8 or SYSTEM == "rocm"
V2 = os.getenv("EXLLAMA_VERSION", "2") == "2"

if os.getenv("DISABLE_EXLLAMA") == "True":
    HAS_EXLLAMA = False
elif CAN_EXLLAMA:
    try:
        if V2:
            from lorax_server.layers.gptq.exllamav2 import QuantLinear
            HAS_EXLLAMA = "2"
        else:
            from lorax_server.layers.gptq.exllama import Ex4bitLinear
            HAS_EXLLAMA = "1"
    except ImportError:
        pass

BitsAndBytes deprecation from `server/lorax_server/layers/bnb.py:7-12`:

def warn_deprecate_bnb():
    logger.warning(
        "Bitsandbytes 8bit is deprecated, using `eetq` is a drop-in replacement "
        "with better performance"
    )

FP8 gating from `server/lorax_server/utils/torch_utils.py:17-22`:

def is_fp8_supported():
    return (
        torch.cuda.is_available()
        and (torch.cuda.get_device_capability()[0] >= 9)
        or (torch.cuda.get_device_capability()[0] == 8
            and torch.cuda.get_device_capability()[1] >= 9)
    )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment