Heuristic:PacktPublishing LLM Engineers Handbook LoRA Finetuning Parameters
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Finetuning, Optimization |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
LoRA configuration for Llama 3.1 8B fine-tuning using rank 32, alpha 32, zero dropout, targeting all attention and MLP projection layers.
Description
This heuristic captures the specific LoRA (Low-Rank Adaptation) hyperparameter choices for fine-tuning Llama 3.1 8B. The configuration uses rank 32 with alpha 32 (giving an effective scaling factor of 1.0), zero dropout (relying on the small dataset size and short training for regularization), and targets all seven projection layers in each transformer block rather than just the attention Q/V projections. This broader targeting trades slightly more trainable parameters for better adaptation quality.
Usage
Use this heuristic when configuring LoRA adapters for Llama-family models in the 7B-8B parameter range. The settings are specifically tuned for the SFT (Supervised Fine-Tuning) phase with the Unsloth optimizer. DPO training reuses the same LoRA config but with a different learning rate.
The Insight (Rule of Thumb)
- Action: Set LoRA rank and alpha both to 32, dropout to 0.0, and target all projection layers.
- Value:
- `lora_rank` = 32
- `lora_alpha` = 32 (scaling factor = alpha/rank = 1.0)
- `lora_dropout` = 0.0
- `target_modules` = `["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"]`
- Trade-off: More trainable parameters than the minimal Q/V-only approach (~2x), but significantly better fine-tuning quality. Zero dropout works because training is short (3 epochs) and the dataset is relatively small.
- Optimizer: `adamw_8bit` with weight decay 0.01 and linear learning rate schedule.
- Batch Size: `per_device_train_batch_size=2` with `gradient_accumulation_steps=8` gives effective batch size of 16.
Reasoning
The choice of rank 32 balances expressiveness against memory cost. Higher ranks (64, 128) marginally improve quality but double or quadruple adapter memory. Setting alpha equal to rank (both 32) means the LoRA contribution is scaled by 1.0 and is not amplified or dampened. Targeting all projection layers (not just Q/V) is standard practice for instruction fine-tuning where the model needs to adapt its entire generation distribution, not just attention patterns. The `adamw_8bit` optimizer from bitsandbytes reduces optimizer state memory by 75% with negligible quality loss, which is critical when fine-tuning on a single GPU.
Training configuration from `llm_engineering/model/finetuning/finetune.py:126-142`:
learning_rate=learning_rate, # 3e-4 for SFT
num_train_epochs=num_train_epochs, # 3
per_device_train_batch_size=per_device_train_batch_size, # 2
gradient_accumulation_steps=gradient_accumulation_steps, # 8
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
per_device_eval_batch_size=per_device_train_batch_size,
warmup_steps=10,
LoRA configuration from `llm_engineering/model/finetuning/finetune.py:68-71`:
lora_rank: int = 32
lora_alpha: int = 32
lora_dropout: float = 0.0
target_modules: List[str] = ["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"]
Related Pages
- Implementation:PacktPublishing_LLM_Engineers_Handbook_FastLanguageModel_Get_Peft_Model
- Implementation:PacktPublishing_LLM_Engineers_Handbook_SFTTrainer_Train
- Principle:PacktPublishing_LLM_Engineers_Handbook_LoRA_Adapter_Injection
- Principle:PacktPublishing_LLM_Engineers_Handbook_Supervised_Finetuning