Environment:Hiyouga LLaMA Factory Quantization Dependencies
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Quantization |
| Last Updated | 2026-02-06 20:00 GMT |
Overview
Optional quantization dependencies for loading and training quantized models via BitsAndBytes, GPTQ, AWQ, AQLM, HQQ, and EETQ backends.
Description
LLaMA Factory supports multiple quantization backends for loading pre-quantized models (PTQ) and performing on-the-fly quantization (OTF). Each backend has its own package requirements and hardware constraints. BitsAndBytes is the most common choice for QLoRA training, while GPTQ and AWQ are used for pre-quantized model inference. The quantization configuration is centrally managed in model_utils/quantization.py.
Usage
Use this environment when you need to load quantized models (4-bit or 8-bit) or export models with GPTQ quantization. Required for any QLoRA fine-tuning workflow or inference with pre-quantized models.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA GPU | Required for BitsAndBytes, GPTQ, AWQ, EETQ |
| VRAM | >= 8GB | 4-bit quantization reduces VRAM by ~75% |
Dependencies
BitsAndBytes (4-bit / 8-bit)
bitsandbytes>= 0.39.0 (4-bit quantization)bitsandbytes>= 0.37.0 (8-bit quantization)bitsandbytes>= 0.43.0 (FSDP+QLoRA or auto device map)
GPTQ
gptqmodel>= 2.0.0optimum>= 1.24.0 (for export)
AWQ
autoawq
AQLM (2-bit)
aqlm[gpu]>= 1.1.0
HQQ (1-8 bit)
hqq
EETQ (8-bit)
eetq
Credentials
No additional credentials required beyond the core environment.
Quick Install
# BitsAndBytes (most common for QLoRA)
pip install bitsandbytes>=0.43.0
# GPTQ quantization export
pip install gptqmodel>=2.0.0 optimum>=1.24.0
# AWQ
pip install autoawq --no-build-isolation
# AQLM
pip install aqlm[gpu]>=1.1.0
# HQQ
pip install hqq
# EETQ
pip install eetq
Code Evidence
BitsAndBytes version checks from src/llamafactory/model/model_utils/quantization.py:167-193:
if model_args.quantization_method == QuantizationMethod.BNB:
if model_args.quantization_bit == 8:
check_version("bitsandbytes>=0.37.0", mandatory=True)
init_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
elif model_args.quantization_bit == 4:
check_version("bitsandbytes>=0.39.0", mandatory=True)
init_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=model_args.compute_dtype,
bnb_4bit_use_double_quant=model_args.double_quantization,
bnb_4bit_quant_type=model_args.quantization_type,
bnb_4bit_quant_storage=model_args.compute_dtype, # crucial for fsdp+qlora
)
FSDP+QLoRA restriction from src/llamafactory/model/model_utils/quantization.py:186-190:
if is_deepspeed_zero3_enabled() or is_fsdp_enabled() or model_args.quantization_device_map == "auto":
if model_args.quantization_bit != 4:
raise ValueError("Only 4-bit quantized model can use fsdp+qlora or auto device map.")
check_version("bitsandbytes>=0.43.0", mandatory=True)
PTQ incompatibility from src/llamafactory/model/model_utils/quantization.py:97-101:
if quant_method not in (QuantizationMethod.MXFP4, QuantizationMethod.FP8) and (
is_deepspeed_zero3_enabled() or is_fsdp_enabled()
):
raise ValueError("DeepSpeed ZeRO-3 or FSDP is incompatible with PTQ-quantized models.")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
Bitsandbytes only accepts 4-bit or 8-bit quantization |
Invalid quantization_bit value | Set quantization_bit to 4 or 8
|
Only 4-bit quantized model can use fsdp+qlora or auto device map |
8-bit with FSDP/DeepSpeed | Use 4-bit quantization with FSDP/DeepSpeed |
DeepSpeed ZeRO-3 or FSDP is incompatible with PTQ-quantized models |
Using GPTQ/AWQ model with ZeRO-3 | Use non-quantized model or single GPU |
HQQ quantization is incompatible with DeepSpeed ZeRO-3 or FSDP |
HQQ with distributed training | Switch to BitsAndBytes 4-bit for distributed |
EETQ only accepts 8-bit quantization |
Wrong bit setting for EETQ | Set quantization_bit=8 for EETQ
|
Quantization is only compatible with the LoRA or OFT method |
Full fine-tuning with quantization | Use finetuning_type=lora or oft
|
Cannot resize embedding layers of a quantized model |
resize_vocab with quantization | Disable resize_vocab for quantized models
|
Compatibility Notes
- BitsAndBytes 4-bit: Only quantization method compatible with FSDP and DeepSpeed ZeRO-3 (requires bitsandbytes >= 0.43.0). The
bnb_4bit_quant_storageparameter is crucial for FSDP+QLoRA. - GPTQ: Exllama kernel is disabled by default. Force fp16 compute dtype during export. ChatGLM models not supported.
- HQQ: Uses ATEN kernel (axis=0) for performance. Incompatible with all distributed training.
- EETQ: Only supports 8-bit quantization. Incompatible with distributed training.
- PTQ Models: MXFP4 and FP8 pre-quantized models are dequantized on load, making them compatible with distributed training.