Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Intel Ipex llm LoRA Target All Linear Layers

From Leeroopedia






Knowledge Sources
Domains LLM_Finetuning, Optimization
Last Updated 2026-02-09 12:00 GMT

Overview

Target all linear layers (q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj) for LoRA adaptation instead of just attention projections, as recommended by the QLoRA paper.

Description

Traditional LoRA approaches only adapt the query and value projection matrices (q_proj, v_proj) in attention layers. The QLoRA paper demonstrates that targeting all linear layers in both the attention mechanism and the feed-forward network yields better adaptation quality. The IPEX-LLM examples apply this recommendation by default, targeting 7 linear layers per transformer block in Llama-family models.

Usage

Use this heuristic when configuring LoraConfig for any QLoRA, LoRA, or DPO finetuning run. The default target modules in IPEX-LLM examples already implement this recommendation. Adjust the module names if using non-Llama architectures (e.g., ChatGLM has different projection names).

The Insight (Rule of Thumb)

  • Action: Set `target_modules` to include all 7 linear layers for Llama-family models:
    • Attention: `q_proj`, `k_proj`, `v_proj`, `o_proj`
    • Feed-Forward: `up_proj`, `down_proj`, `gate_proj`
  • Value: Default `lora_r=8`, `lora_alpha=16`, `lora_dropout=0.05`.
  • Trade-off: More target modules means more trainable parameters (still small relative to total) and slightly more memory. But produces significantly better adaptation quality.

Reasoning

The QLoRA paper found that adapting all linear layers allows the LoRA adapter to capture changes across the full computation path of each transformer block. Adapting only q/v projections limits the adapter's expressiveness — it can modify attention patterns but not the feed-forward transformation. In Llama-family models, the feed-forward network (up_proj, down_proj, gate_proj) processes the attention output and represents a significant portion of the model's computation. With `lora_r=8` across 7 modules, the total trainable parameter count remains under 1% of the full model.

Code Evidence

Full target modules from `alpaca_qlora_finetuning.py:87-95`:

lora_target_modules: List[str] = [
    "q_proj",
    "v_proj",
    "k_proj",
    "o_proj",
    "up_proj",
    "down_proj",
    "gate_proj"
],  # according to the QLoRA paper (https://arxiv.org/pdf/2305.14314.pdf), it's suggested to fine tune all linear layers

Same pattern in DPO from `dpo_finetuning.py:105-112`:

peft_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)

LoRA config creation from `alpaca_qlora_finetuning.py:212-220`:

config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    target_modules=lora_target_modules,
    lora_dropout=lora_dropout,
    bias="none",
    task_type="CAUSAL_LM",
    training_mode=training_mode,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment