Implementation:Intel Ipex llm AutoModelForCausalLM From Pretrained QLoRA
| Knowledge Sources | |
|---|---|
| Domains | NLP, Model_Quantization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for loading language models with 4-bit NF4 quantization for QLoRA fine-tuning on Intel XPU, provided by IPEX-LLM.
Description
The AutoModelForCausalLM.from_pretrained from ipex_llm.transformers is a drop-in replacement for HuggingFace's AutoModelForCausalLM that supports Intel XPU-optimized quantization. For QLoRA, it accepts a BitsAndBytesConfig with NF4 settings. The model is loaded in 4-bit precision with bfloat16 compute dtype, ready for LoRA adapter injection.
Usage
Use this when loading a base model for QLoRA fine-tuning on Intel GPUs. The BitsAndBytesConfig interface is compatible with the standard HuggingFace API but optimized for Intel XPU.
Code Reference
Source Location
- Repository: IPEX-LLM
- File: python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/alpaca_qlora_finetuning.py
- Lines: 177-196
Signature
# BitsAndBytesConfig for NF4 quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
model_id: str,
torch_dtype=torch.bfloat16,
quantization_config: BitsAndBytesConfig = None,
trust_remote_code: bool = True
) -> PreTrainedModel
Import
from transformers import BitsAndBytesConfig
from ipex_llm.transformers import AutoModelForCausalLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_id | str | Yes | HuggingFace model ID or local path (e.g., "meta-llama/Llama-2-7b-hf") |
| quantization_config | BitsAndBytesConfig | Yes | 4-bit NF4 quantization configuration |
| torch_dtype | torch.dtype | No | Compute dtype (default torch.bfloat16) |
| trust_remote_code | bool | No | Allow custom model code from HuggingFace Hub |
Outputs
| Name | Type | Description |
|---|---|---|
| model | PreTrainedModel | 4-bit NF4 quantized model ready for LoRA adapter injection |
Usage Examples
import torch
from transformers import BitsAndBytesConfig, AutoTokenizer
from ipex_llm.transformers import AutoModelForCausalLM
# 1. Configure NF4 quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# 2. Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config,
trust_remote_code=True
)
# 3. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token