Heuristic:Intel Ipex llm QLoRA Training Hyperparameters
| Knowledge Sources | |
|---|---|
| Domains | LLM_Finetuning, Optimization |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Recommended hyperparameter values for stable QLoRA/LoRA finetuning on Intel XPU: learning_rate=3e-5, max_grad_norm=0.3, micro_batch_size=2, bf16=True, cosine scheduler.
Description
QLoRA and LoRA finetuning on Intel XPU hardware requires carefully tuned hyperparameters to avoid training divergence while maximizing model quality within GPU memory constraints. The IPEX-LLM examples encode empirically validated defaults that balance training stability, convergence speed, and memory usage. These defaults differ from typical CUDA-based training recommendations due to the specifics of Intel GPU memory architecture and bf16 compute capabilities.
Usage
Use this heuristic when configuring TrainingArguments for any QLoRA or LoRA finetuning run on Intel XPU. Apply these defaults as starting points and adjust based on specific model size and dataset characteristics.
The Insight (Rule of Thumb)
- Learning Rate: `learning_rate=3e-5` to avoid divergence. Higher rates (e.g., 1e-4) cause training instability on 4-bit quantized models.
- Gradient Clipping: `max_grad_norm=0.3` (aggressive compared to typical 1.0). Tighter clipping compensates for noise introduced by 4-bit quantization.
- Micro Batch Size: `micro_batch_size=2` per device, limited by GPU memory. Use gradient accumulation to reach effective batch size of 128.
- Precision: `bf16=True` ensures training stability. BF16 maintains FP32 dynamic range with reduced memory.
- LR Scheduler: `lr_scheduler_type="cosine"` for smooth decay to near-zero.
- Optimizer: `optim="adamw_torch"` (paged_adamw_8bit not yet supported on Intel platform).
- Padding: `pad_to_multiple_of=8` for GPU memory alignment and kernel efficiency.
- Eval/Save: Every 100 steps. `save_total_limit=100` to prevent disk explosion.
- Trade-off: Conservative hyperparameters prioritize stability over speed. Training may be slower but more reliable.
Reasoning
Low-bit quantized models introduce noise in gradient computation. The aggressive gradient clipping (0.3 vs standard 1.0) prevents gradient explosions from quantization noise. The low learning rate (3e-5 vs common 2e-4 for LoRA) prevents the model from overshooting due to noisy gradients. BF16 compute maintains the wide dynamic range of FP32 (8 exponent bits) while using half the memory, which is essential for stable training with quantized weights. The micro batch size of 2 is the practical maximum for 7B parameter models with 4-bit quantization on typical Intel GPU VRAM.
Code Evidence
Hyperparameter defaults from `alpaca_qlora_finetuning.py:76-81`:
bf16: bool = True, # default to bf16
batch_size: int = 128,
micro_batch_size: int = 2, # default to be 2, limited by GPU memory
num_epochs: int = 3,
learning_rate: float = 3e-5, # default to be 3e-5 to avoid divergence
cutoff_len: int = 256,
TrainingArguments from `alpaca_qlora_finetuning.py:244-270`:
args=transformers.TrainingArguments(
per_device_train_batch_size=micro_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
max_grad_norm=0.3,
num_train_epochs=num_epochs,
learning_rate=learning_rate,
lr_scheduler_type="cosine",
bf16=True, # ensure training more stable
logging_steps=1,
optim="adamw_torch",
save_safetensors=False,
)
Padding alignment from `alpaca_qlora_finetuning.py:272-274`:
data_collator=transformers.DataCollatorForSeq2Seq(
tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
)
DPO-specific differences from `dpo_finetuning.py:146-163`:
training_args = DPOConfig(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=5e-5,
beta=0.1,
max_prompt_length=1024,
max_length=1536,
max_steps=200,
optim="adamw_hf",
# optim="paged_adamw_32bit", # "paged_adamw_32bit" is not supported yet
)
Related Pages
- Implementation:Intel_Ipex_llm_Transformers_Trainer_QLoRA
- Implementation:Intel_Ipex_llm_Transformers_Trainer_LoRA
- Implementation:Intel_Ipex_llm_DPOTrainer_Usage
- Principle:Intel_Ipex_llm_Training_With_HF_Trainer_QLoRA
- Principle:Intel_Ipex_llm_Training_With_HF_Trainer_LoRA
- Principle:Intel_Ipex_llm_DPO_Training