Heuristic:Intel Ipex llm NF4 Quantization Best Practice
| Knowledge Sources | |
|---|---|
| Domains | LLM_Finetuning, Optimization |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Use NF4 (NormalFloat4) quantization instead of INT4 for QLoRA finetuning, and exclude `lm_head` from quantization to preserve output quality.
Description
When loading models for QLoRA finetuning, the quantization type significantly impacts model quality. NormalFloat4 (NF4) is an information-theoretically optimal data type for normally distributed weights, as demonstrated in the QLoRA paper. It yields measurably better model quality than standard INT4 quantization. Additionally, the language modeling head (`lm_head`) should be excluded from low-bit conversion because quantizing the output prediction layer disproportionately hurts token prediction quality.
Usage
Use this heuristic when configuring BitsAndBytesConfig for QLoRA or when using `load_in_low_bit` for any 4-bit model loading in IPEX-LLM. Apply it whenever you need to balance model compression with quality preservation.
The Insight (Rule of Thumb)
- Action: Set `bnb_4bit_quant_type="nf4"` in `BitsAndBytesConfig` instead of `"int4"`.
- Action: Set `modules_to_not_convert=["lm_head"]` when loading models.
- Action: Set `bnb_4bit_compute_dtype=torch.bfloat16` for compute precision.
- Action: Set `bnb_4bit_use_double_quant=False` (double quantization disabled by default in IPEX-LLM examples).
- Trade-off: NF4 has identical memory footprint to INT4 (4 bits per weight) but requires slightly more compute during dequantization.
Reasoning
The QLoRA paper (Dettmers et al., 2023) demonstrates that NF4 is information-theoretically optimal for quantizing normally distributed neural network weights. Since pretrained transformer weights approximately follow a normal distribution, NF4 produces lower quantization error than uniform INT4. The `lm_head` exclusion ensures the final token prediction layer operates at full precision, which is critical because small errors in the output logits can significantly change predicted token probabilities.
Code Evidence
NF4 configuration from `alpaca_qlora_finetuning.py:175-182`:
# According to the QLoRA paper, using "nf4" could yield better model quality than "int4"
# use bnb_config for qlora/qalora/relora, which use 4bit for base model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
lm_head exclusion from `alpaca_qlora_finetuning.py:171`:
modules_to_not_convert=["lm_head"],
Same pattern in DPO from `dpo_finetuning.py:114-118`:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
lm_head exclusion for reference model in DPO from `dpo_finetuning.py:142`:
modules_to_not_convert=["lm_head"],