Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Intel Ipex llm QLoRA Training Hyperparameters

From Leeroopedia




Knowledge Sources
Domains LLM_Finetuning, Optimization
Last Updated 2026-02-09 12:00 GMT

Overview

Recommended hyperparameter values for stable QLoRA/LoRA finetuning on Intel XPU: learning_rate=3e-5, max_grad_norm=0.3, micro_batch_size=2, bf16=True, cosine scheduler.

Description

QLoRA and LoRA finetuning on Intel XPU hardware requires carefully tuned hyperparameters to avoid training divergence while maximizing model quality within GPU memory constraints. The IPEX-LLM examples encode empirically validated defaults that balance training stability, convergence speed, and memory usage. These defaults differ from typical CUDA-based training recommendations due to the specifics of Intel GPU memory architecture and bf16 compute capabilities.

Usage

Use this heuristic when configuring TrainingArguments for any QLoRA or LoRA finetuning run on Intel XPU. Apply these defaults as starting points and adjust based on specific model size and dataset characteristics.

The Insight (Rule of Thumb)

  • Learning Rate: `learning_rate=3e-5` to avoid divergence. Higher rates (e.g., 1e-4) cause training instability on 4-bit quantized models.
  • Gradient Clipping: `max_grad_norm=0.3` (aggressive compared to typical 1.0). Tighter clipping compensates for noise introduced by 4-bit quantization.
  • Micro Batch Size: `micro_batch_size=2` per device, limited by GPU memory. Use gradient accumulation to reach effective batch size of 128.
  • Precision: `bf16=True` ensures training stability. BF16 maintains FP32 dynamic range with reduced memory.
  • LR Scheduler: `lr_scheduler_type="cosine"` for smooth decay to near-zero.
  • Optimizer: `optim="adamw_torch"` (paged_adamw_8bit not yet supported on Intel platform).
  • Padding: `pad_to_multiple_of=8` for GPU memory alignment and kernel efficiency.
  • Eval/Save: Every 100 steps. `save_total_limit=100` to prevent disk explosion.
  • Trade-off: Conservative hyperparameters prioritize stability over speed. Training may be slower but more reliable.

Reasoning

Low-bit quantized models introduce noise in gradient computation. The aggressive gradient clipping (0.3 vs standard 1.0) prevents gradient explosions from quantization noise. The low learning rate (3e-5 vs common 2e-4 for LoRA) prevents the model from overshooting due to noisy gradients. BF16 compute maintains the wide dynamic range of FP32 (8 exponent bits) while using half the memory, which is essential for stable training with quantized weights. The micro batch size of 2 is the practical maximum for 7B parameter models with 4-bit quantization on typical Intel GPU VRAM.

Code Evidence

Hyperparameter defaults from `alpaca_qlora_finetuning.py:76-81`:

bf16: bool = True,  # default to bf16
batch_size: int = 128,
micro_batch_size: int = 2,  # default to be 2, limited by GPU memory
num_epochs: int = 3,
learning_rate: float = 3e-5,  # default to be 3e-5 to avoid divergence
cutoff_len: int = 256,

TrainingArguments from `alpaca_qlora_finetuning.py:244-270`:

args=transformers.TrainingArguments(
    per_device_train_batch_size=micro_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    max_grad_norm=0.3,
    num_train_epochs=num_epochs,
    learning_rate=learning_rate,
    lr_scheduler_type="cosine",
    bf16=True,  # ensure training more stable
    logging_steps=1,
    optim="adamw_torch",
    save_safetensors=False,
)

Padding alignment from `alpaca_qlora_finetuning.py:272-274`:

data_collator=transformers.DataCollatorForSeq2Seq(
    tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
)

DPO-specific differences from `dpo_finetuning.py:146-163`:

training_args = DPOConfig(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    beta=0.1,
    max_prompt_length=1024,
    max_length=1536,
    max_steps=200,
    optim="adamw_hf",
    # optim="paged_adamw_32bit", # "paged_adamw_32bit" is not supported yet
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment