Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Intel Ipex llm NF4 Quantization Best Practice

From Leeroopedia



Knowledge Sources
Domains LLM_Finetuning, Optimization
Last Updated 2026-02-09 12:00 GMT

Overview

Use NF4 (NormalFloat4) quantization instead of INT4 for QLoRA finetuning, and exclude `lm_head` from quantization to preserve output quality.

Description

When loading models for QLoRA finetuning, the quantization type significantly impacts model quality. NormalFloat4 (NF4) is an information-theoretically optimal data type for normally distributed weights, as demonstrated in the QLoRA paper. It yields measurably better model quality than standard INT4 quantization. Additionally, the language modeling head (`lm_head`) should be excluded from low-bit conversion because quantizing the output prediction layer disproportionately hurts token prediction quality.

Usage

Use this heuristic when configuring BitsAndBytesConfig for QLoRA or when using `load_in_low_bit` for any 4-bit model loading in IPEX-LLM. Apply it whenever you need to balance model compression with quality preservation.

The Insight (Rule of Thumb)

  • Action: Set `bnb_4bit_quant_type="nf4"` in `BitsAndBytesConfig` instead of `"int4"`.
  • Action: Set `modules_to_not_convert=["lm_head"]` when loading models.
  • Action: Set `bnb_4bit_compute_dtype=torch.bfloat16` for compute precision.
  • Action: Set `bnb_4bit_use_double_quant=False` (double quantization disabled by default in IPEX-LLM examples).
  • Trade-off: NF4 has identical memory footprint to INT4 (4 bits per weight) but requires slightly more compute during dequantization.

Reasoning

The QLoRA paper (Dettmers et al., 2023) demonstrates that NF4 is information-theoretically optimal for quantizing normally distributed neural network weights. Since pretrained transformer weights approximately follow a normal distribution, NF4 produces lower quantization error than uniform INT4. The `lm_head` exclusion ensures the final token prediction layer operates at full precision, which is critical because small errors in the output logits can significantly change predicted token probabilities.

Code Evidence

NF4 configuration from `alpaca_qlora_finetuning.py:175-182`:

# According to the QLoRA paper, using "nf4" could yield better model quality than "int4"
# use bnb_config for qlora/qalora/relora, which use 4bit for base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

lm_head exclusion from `alpaca_qlora_finetuning.py:171`:

modules_to_not_convert=["lm_head"],

Same pattern in DPO from `dpo_finetuning.py:114-118`:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

lm_head exclusion for reference model in DPO from `dpo_finetuning.py:142`:

modules_to_not_convert=["lm_head"],

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment