Heuristic:Hiyouga LLaMA Factory Quantized Training Best Practices

Knowledge Sources	LLaMA-Factory
Domains	Quantization, Configuration_Best_Practice
Last Updated	2026-02-06 20:00 GMT

Overview

Best practices and restrictions for training with quantized models, including QLoRA configuration, evaluation caveats, and compatibility constraints.

Description

Quantized training in LLaMA Factory (QLoRA) has specific constraints and best practices that are enforced through validation checks and warnings. Only LoRA and OFT fine-tuning methods are compatible with quantization. Several operations are restricted: vocabulary resizing, PiSSA initialization, and creating new adapters on quantized models. For distributed training, only BitsAndBytes 4-bit quantization works with FSDP and DeepSpeed ZeRO-3, requiring bnb_4bit_quant_storage to match the compute dtype.

Usage

Use this heuristic whenever training or evaluating quantized models (4-bit or 8-bit). Follow these rules to avoid common failures and suboptimal results.

The Insight (Rule of Thumb)

Action 1: Use finetuning_type=lora or finetuning_type=oft with quantized models. Full and freeze fine-tuning are not supported.
Action 2: Enable upcast_layernorm=True for better training stability with quantized models.
Action 3: Use only a single adapter checkpoint with quantized models. Merge multiple adapters first.
Action 4: Do not use quantization_device_map=auto during training; it is inference-only.
Action 5: For FSDP+QLoRA, the bnb_4bit_quant_storage is automatically set to match compute_dtype (crucial for correct operation).
Evaluation Warning: Evaluating in 4/8-bit mode may produce lower scores than full-precision evaluation.
Trade-off: Quantization reduces VRAM by 50-75% but may decrease model quality slightly. Double quantization (double_quantization=True) further reduces memory at a small accuracy cost.

Reasoning

Quantization method restriction from src/llamafactory/hparams/parser.py:126-128:

if model_args.quantization_bit is not None:
    if finetuning_args.finetuning_type not in ["lora", "oft"]:
        raise ValueError("Quantization is only compatible with the LoRA or OFT method.")

Single adapter restriction from src/llamafactory/hparams/parser.py:139-140:

if model_args.adapter_name_or_path is not None and len(model_args.adapter_name_or_path) != 1:
    raise ValueError("Quantized model only accepts a single adapter. Merge them first.")

Evaluation warning from src/llamafactory/hparams/parser.py:386-387:

if (not training_args.do_train) and model_args.quantization_bit is not None:
    logger.warning_rank0("Evaluating model in 4/8-bit mode may cause lower scores.")

FSDP+QLoRA quant storage from src/llamafactory/model/model_utils/quantization.py:173-179:

init_kwargs["quantization_config"] = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=model_args.compute_dtype,
    bnb_4bit_use_double_quant=model_args.double_quantization,
    bnb_4bit_quant_type=model_args.quantization_type,
    bnb_4bit_quant_storage=model_args.compute_dtype,  # crucial for fsdp+qlora
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment