Heuristic:Hiyouga LLaMA Factory Quantized Training Best Practices
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Configuration_Best_Practice |
| Last Updated | 2026-02-06 20:00 GMT |
Overview
Best practices and restrictions for training with quantized models, including QLoRA configuration, evaluation caveats, and compatibility constraints.
Description
Quantized training in LLaMA Factory (QLoRA) has specific constraints and best practices that are enforced through validation checks and warnings. Only LoRA and OFT fine-tuning methods are compatible with quantization. Several operations are restricted: vocabulary resizing, PiSSA initialization, and creating new adapters on quantized models. For distributed training, only BitsAndBytes 4-bit quantization works with FSDP and DeepSpeed ZeRO-3, requiring bnb_4bit_quant_storage to match the compute dtype.
Usage
Use this heuristic whenever training or evaluating quantized models (4-bit or 8-bit). Follow these rules to avoid common failures and suboptimal results.
The Insight (Rule of Thumb)
- Action 1: Use
finetuning_type=loraorfinetuning_type=oftwith quantized models. Full and freeze fine-tuning are not supported. - Action 2: Enable
upcast_layernorm=Truefor better training stability with quantized models. - Action 3: Use only a single adapter checkpoint with quantized models. Merge multiple adapters first.
- Action 4: Do not use
quantization_device_map=autoduring training; it is inference-only. - Action 5: For FSDP+QLoRA, the
bnb_4bit_quant_storageis automatically set to matchcompute_dtype(crucial for correct operation). - Evaluation Warning: Evaluating in 4/8-bit mode may produce lower scores than full-precision evaluation.
- Trade-off: Quantization reduces VRAM by 50-75% but may decrease model quality slightly. Double quantization (
double_quantization=True) further reduces memory at a small accuracy cost.
Reasoning
Quantization method restriction from src/llamafactory/hparams/parser.py:126-128:
if model_args.quantization_bit is not None:
if finetuning_args.finetuning_type not in ["lora", "oft"]:
raise ValueError("Quantization is only compatible with the LoRA or OFT method.")
Single adapter restriction from src/llamafactory/hparams/parser.py:139-140:
if model_args.adapter_name_or_path is not None and len(model_args.adapter_name_or_path) != 1:
raise ValueError("Quantized model only accepts a single adapter. Merge them first.")
Evaluation warning from src/llamafactory/hparams/parser.py:386-387:
if (not training_args.do_train) and model_args.quantization_bit is not None:
logger.warning_rank0("Evaluating model in 4/8-bit mode may cause lower scores.")
FSDP+QLoRA quant storage from src/llamafactory/model/model_utils/quantization.py:173-179:
init_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=model_args.compute_dtype,
bnb_4bit_use_double_quant=model_args.double_quantization,
bnb_4bit_quant_type=model_args.quantization_type,
bnb_4bit_quant_storage=model_args.compute_dtype, # crucial for fsdp+qlora
)