Environment:Hiyouga LLaMA Factory Quantization Dependencies

Knowledge Sources	LLaMA-Factory bitsandbytes
Domains	Infrastructure, Quantization
Last Updated	2026-02-06 20:00 GMT

Overview

Optional quantization dependencies for loading and training quantized models via BitsAndBytes, GPTQ, AWQ, AQLM, HQQ, and EETQ backends.

Description

LLaMA Factory supports multiple quantization backends for loading pre-quantized models (PTQ) and performing on-the-fly quantization (OTF). Each backend has its own package requirements and hardware constraints. BitsAndBytes is the most common choice for QLoRA training, while GPTQ and AWQ are used for pre-quantized model inference. The quantization configuration is centrally managed in model_utils/quantization.py.

Usage

Use this environment when you need to load quantized models (4-bit or 8-bit) or export models with GPTQ quantization. Required for any QLoRA fine-tuning workflow or inference with pre-quantized models.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA GPU	Required for BitsAndBytes, GPTQ, AWQ, EETQ
VRAM	>= 8GB	4-bit quantization reduces VRAM by ~75%

Dependencies

BitsAndBytes (4-bit / 8-bit)

bitsandbytes >= 0.39.0 (4-bit quantization)
bitsandbytes >= 0.37.0 (8-bit quantization)
bitsandbytes >= 0.43.0 (FSDP+QLoRA or auto device map)

GPTQ

gptqmodel >= 2.0.0
optimum >= 1.24.0 (for export)

AWQ

autoawq

AQLM (2-bit)

aqlm[gpu] >= 1.1.0

HQQ (1-8 bit)

hqq

EETQ (8-bit)

eetq

Credentials

No additional credentials required beyond the core environment.

Quick Install

# BitsAndBytes (most common for QLoRA)
pip install bitsandbytes>=0.43.0

# GPTQ quantization export
pip install gptqmodel>=2.0.0 optimum>=1.24.0

# AWQ
pip install autoawq --no-build-isolation

# AQLM
pip install aqlm[gpu]>=1.1.0

# HQQ
pip install hqq

# EETQ
pip install eetq

Code Evidence

BitsAndBytes version checks from src/llamafactory/model/model_utils/quantization.py:167-193:

if model_args.quantization_method == QuantizationMethod.BNB:
    if model_args.quantization_bit == 8:
        check_version("bitsandbytes>=0.37.0", mandatory=True)
        init_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
    elif model_args.quantization_bit == 4:
        check_version("bitsandbytes>=0.39.0", mandatory=True)
        init_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=model_args.compute_dtype,
            bnb_4bit_use_double_quant=model_args.double_quantization,
            bnb_4bit_quant_type=model_args.quantization_type,
            bnb_4bit_quant_storage=model_args.compute_dtype,  # crucial for fsdp+qlora
        )

FSDP+QLoRA restriction from src/llamafactory/model/model_utils/quantization.py:186-190:

if is_deepspeed_zero3_enabled() or is_fsdp_enabled() or model_args.quantization_device_map == "auto":
    if model_args.quantization_bit != 4:
        raise ValueError("Only 4-bit quantized model can use fsdp+qlora or auto device map.")
    check_version("bitsandbytes>=0.43.0", mandatory=True)

PTQ incompatibility from src/llamafactory/model/model_utils/quantization.py:97-101:

if quant_method not in (QuantizationMethod.MXFP4, QuantizationMethod.FP8) and (
    is_deepspeed_zero3_enabled() or is_fsdp_enabled()
):
    raise ValueError("DeepSpeed ZeRO-3 or FSDP is incompatible with PTQ-quantized models.")

Common Errors

Error Message	Cause	Solution
`Bitsandbytes only accepts 4-bit or 8-bit quantization`	Invalid quantization_bit value	Set `quantization_bit` to 4 or 8
`Only 4-bit quantized model can use fsdp+qlora or auto device map`	8-bit with FSDP/DeepSpeed	Use 4-bit quantization with FSDP/DeepSpeed
`DeepSpeed ZeRO-3 or FSDP is incompatible with PTQ-quantized models`	Using GPTQ/AWQ model with ZeRO-3	Use non-quantized model or single GPU
`HQQ quantization is incompatible with DeepSpeed ZeRO-3 or FSDP`	HQQ with distributed training	Switch to BitsAndBytes 4-bit for distributed
`EETQ only accepts 8-bit quantization`	Wrong bit setting for EETQ	Set `quantization_bit=8` for EETQ
`Quantization is only compatible with the LoRA or OFT method`	Full fine-tuning with quantization	Use `finetuning_type=lora` or `oft`
`Cannot resize embedding layers of a quantized model`	resize_vocab with quantization	Disable `resize_vocab` for quantized models

Compatibility Notes

BitsAndBytes 4-bit: Only quantization method compatible with FSDP and DeepSpeed ZeRO-3 (requires bitsandbytes >= 0.43.0). The bnb_4bit_quant_storage parameter is crucial for FSDP+QLoRA.
GPTQ: Exllama kernel is disabled by default. Force fp16 compute dtype during export. ChatGLM models not supported.
HQQ: Uses ATEN kernel (axis=0) for performance. Incompatible with all distributed training.
EETQ: Only supports 8-bit quantization. Incompatible with distributed training.
PTQ Models: MXFP4 and FP8 pre-quantized models are dequantized on load, making them compatible with distributed training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment