Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Hiyouga LLaMA Factory Quantization Dependencies

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Quantization
Last Updated 2026-02-06 20:00 GMT

Overview

Optional quantization dependencies for loading and training quantized models via BitsAndBytes, GPTQ, AWQ, AQLM, HQQ, and EETQ backends.

Description

LLaMA Factory supports multiple quantization backends for loading pre-quantized models (PTQ) and performing on-the-fly quantization (OTF). Each backend has its own package requirements and hardware constraints. BitsAndBytes is the most common choice for QLoRA training, while GPTQ and AWQ are used for pre-quantized model inference. The quantization configuration is centrally managed in model_utils/quantization.py.

Usage

Use this environment when you need to load quantized models (4-bit or 8-bit) or export models with GPTQ quantization. Required for any QLoRA fine-tuning workflow or inference with pre-quantized models.

System Requirements

Category Requirement Notes
Hardware NVIDIA GPU Required for BitsAndBytes, GPTQ, AWQ, EETQ
VRAM >= 8GB 4-bit quantization reduces VRAM by ~75%

Dependencies

BitsAndBytes (4-bit / 8-bit)

  • bitsandbytes >= 0.39.0 (4-bit quantization)
  • bitsandbytes >= 0.37.0 (8-bit quantization)
  • bitsandbytes >= 0.43.0 (FSDP+QLoRA or auto device map)

GPTQ

  • gptqmodel >= 2.0.0
  • optimum >= 1.24.0 (for export)

AWQ

  • autoawq

AQLM (2-bit)

  • aqlm[gpu] >= 1.1.0

HQQ (1-8 bit)

  • hqq

EETQ (8-bit)

  • eetq

Credentials

No additional credentials required beyond the core environment.

Quick Install

# BitsAndBytes (most common for QLoRA)
pip install bitsandbytes>=0.43.0

# GPTQ quantization export
pip install gptqmodel>=2.0.0 optimum>=1.24.0

# AWQ
pip install autoawq --no-build-isolation

# AQLM
pip install aqlm[gpu]>=1.1.0

# HQQ
pip install hqq

# EETQ
pip install eetq

Code Evidence

BitsAndBytes version checks from src/llamafactory/model/model_utils/quantization.py:167-193:

if model_args.quantization_method == QuantizationMethod.BNB:
    if model_args.quantization_bit == 8:
        check_version("bitsandbytes>=0.37.0", mandatory=True)
        init_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
    elif model_args.quantization_bit == 4:
        check_version("bitsandbytes>=0.39.0", mandatory=True)
        init_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=model_args.compute_dtype,
            bnb_4bit_use_double_quant=model_args.double_quantization,
            bnb_4bit_quant_type=model_args.quantization_type,
            bnb_4bit_quant_storage=model_args.compute_dtype,  # crucial for fsdp+qlora
        )

FSDP+QLoRA restriction from src/llamafactory/model/model_utils/quantization.py:186-190:

if is_deepspeed_zero3_enabled() or is_fsdp_enabled() or model_args.quantization_device_map == "auto":
    if model_args.quantization_bit != 4:
        raise ValueError("Only 4-bit quantized model can use fsdp+qlora or auto device map.")
    check_version("bitsandbytes>=0.43.0", mandatory=True)

PTQ incompatibility from src/llamafactory/model/model_utils/quantization.py:97-101:

if quant_method not in (QuantizationMethod.MXFP4, QuantizationMethod.FP8) and (
    is_deepspeed_zero3_enabled() or is_fsdp_enabled()
):
    raise ValueError("DeepSpeed ZeRO-3 or FSDP is incompatible with PTQ-quantized models.")

Common Errors

Error Message Cause Solution
Bitsandbytes only accepts 4-bit or 8-bit quantization Invalid quantization_bit value Set quantization_bit to 4 or 8
Only 4-bit quantized model can use fsdp+qlora or auto device map 8-bit with FSDP/DeepSpeed Use 4-bit quantization with FSDP/DeepSpeed
DeepSpeed ZeRO-3 or FSDP is incompatible with PTQ-quantized models Using GPTQ/AWQ model with ZeRO-3 Use non-quantized model or single GPU
HQQ quantization is incompatible with DeepSpeed ZeRO-3 or FSDP HQQ with distributed training Switch to BitsAndBytes 4-bit for distributed
EETQ only accepts 8-bit quantization Wrong bit setting for EETQ Set quantization_bit=8 for EETQ
Quantization is only compatible with the LoRA or OFT method Full fine-tuning with quantization Use finetuning_type=lora or oft
Cannot resize embedding layers of a quantized model resize_vocab with quantization Disable resize_vocab for quantized models

Compatibility Notes

  • BitsAndBytes 4-bit: Only quantization method compatible with FSDP and DeepSpeed ZeRO-3 (requires bitsandbytes >= 0.43.0). The bnb_4bit_quant_storage parameter is crucial for FSDP+QLoRA.
  • GPTQ: Exllama kernel is disabled by default. Force fp16 compute dtype during export. ChatGLM models not supported.
  • HQQ: Uses ATEN kernel (axis=0) for performance. Incompatible with all distributed training.
  • EETQ: Only supports 8-bit quantization. Incompatible with distributed training.
  • PTQ Models: MXFP4 and FP8 pre-quantized models are dequantized on load, making them compatible with distributed training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment