Implementation:Intel Ipex llm AutoModelForCausalLM From Pretrained QLoRA

Knowledge Sources	IPEX-LLM HuggingFace Transformers
Domains	NLP, Model_Quantization
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for loading language models with 4-bit NF4 quantization for QLoRA fine-tuning on Intel XPU, provided by IPEX-LLM.

Description

The AutoModelForCausalLM.from_pretrained from ipex_llm.transformers is a drop-in replacement for HuggingFace's AutoModelForCausalLM that supports Intel XPU-optimized quantization. For QLoRA, it accepts a BitsAndBytesConfig with NF4 settings. The model is loaded in 4-bit precision with bfloat16 compute dtype, ready for LoRA adapter injection.

Usage

Use this when loading a base model for QLoRA fine-tuning on Intel GPUs. The BitsAndBytesConfig interface is compatible with the standard HuggingFace API but optimized for Intel XPU.

Code Reference

Source Location

Repository: IPEX-LLM
File: python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/alpaca_qlora_finetuning.py
Lines: 177-196

Signature

# BitsAndBytesConfig for NF4 quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_id: str,
    torch_dtype=torch.bfloat16,
    quantization_config: BitsAndBytesConfig = None,
    trust_remote_code: bool = True
) -> PreTrainedModel

Import

from transformers import BitsAndBytesConfig
from ipex_llm.transformers import AutoModelForCausalLM

I/O Contract

Inputs

Name	Type	Required	Description
model_id	str	Yes	HuggingFace model ID or local path (e.g., "meta-llama/Llama-2-7b-hf")
quantization_config	BitsAndBytesConfig	Yes	4-bit NF4 quantization configuration
torch_dtype	torch.dtype	No	Compute dtype (default torch.bfloat16)
trust_remote_code	bool	No	Allow custom model code from HuggingFace Hub

Outputs

Name	Type	Description
model	PreTrainedModel	4-bit NF4 quantized model ready for LoRA adapter injection

Usage Examples

import torch
from transformers import BitsAndBytesConfig, AutoTokenizer
from ipex_llm.transformers import AutoModelForCausalLM

# 1. Configure NF4 quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 2. Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    trust_remote_code=True
)

# 3. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment