Heuristic:Intel Ipex llm LoRA Target All Linear Layers
| Knowledge Sources | |
|---|---|
| Domains | LLM_Finetuning, Optimization |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Target all linear layers (q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj) for LoRA adaptation instead of just attention projections, as recommended by the QLoRA paper.
Description
Traditional LoRA approaches only adapt the query and value projection matrices (q_proj, v_proj) in attention layers. The QLoRA paper demonstrates that targeting all linear layers in both the attention mechanism and the feed-forward network yields better adaptation quality. The IPEX-LLM examples apply this recommendation by default, targeting 7 linear layers per transformer block in Llama-family models.
Usage
Use this heuristic when configuring LoraConfig for any QLoRA, LoRA, or DPO finetuning run. The default target modules in IPEX-LLM examples already implement this recommendation. Adjust the module names if using non-Llama architectures (e.g., ChatGLM has different projection names).
The Insight (Rule of Thumb)
- Action: Set `target_modules` to include all 7 linear layers for Llama-family models:
- Attention: `q_proj`, `k_proj`, `v_proj`, `o_proj`
- Feed-Forward: `up_proj`, `down_proj`, `gate_proj`
- Value: Default `lora_r=8`, `lora_alpha=16`, `lora_dropout=0.05`.
- Trade-off: More target modules means more trainable parameters (still small relative to total) and slightly more memory. But produces significantly better adaptation quality.
Reasoning
The QLoRA paper found that adapting all linear layers allows the LoRA adapter to capture changes across the full computation path of each transformer block. Adapting only q/v projections limits the adapter's expressiveness — it can modify attention patterns but not the feed-forward transformation. In Llama-family models, the feed-forward network (up_proj, down_proj, gate_proj) processes the attention output and represents a significant portion of the model's computation. With `lora_r=8` across 7 modules, the total trainable parameter count remains under 1% of the full model.
Code Evidence
Full target modules from `alpaca_qlora_finetuning.py:87-95`:
lora_target_modules: List[str] = [
"q_proj",
"v_proj",
"k_proj",
"o_proj",
"up_proj",
"down_proj",
"gate_proj"
], # according to the QLoRA paper (https://arxiv.org/pdf/2305.14314.pdf), it's suggested to fine tune all linear layers
Same pattern in DPO from `dpo_finetuning.py:105-112`:
peft_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)
LoRA config creation from `alpaca_qlora_finetuning.py:212-220`:
config = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
target_modules=lora_target_modules,
lora_dropout=lora_dropout,
bias="none",
task_type="CAUSAL_LM",
training_mode=training_mode,
)