Implementation:Intel Ipex llm AutoModelForCausalLM From Pretrained bf16
| Knowledge Sources | |
|---|---|
| Domains | NLP, Model_Loading |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for loading language models in bfloat16 precision for LoRA fine-tuning on Intel XPU, provided by IPEX-LLM.
Description
The AutoModelForCausalLM.from_pretrained from ipex_llm.transformers with load_in_low_bit="bf16" loads the model in bfloat16 precision without quantization. The optimize_model=False flag disables inference-only XPU optimizations that would interfere with training. The modules_to_not_convert=["lm_head"] parameter excludes the language model head from any low-bit conversion.
Usage
Use this when loading a base model for standard LoRA fine-tuning (not QLoRA) on Intel GPUs where sufficient memory is available for bf16 precision.
Code Reference
Source Location
- Repository: IPEX-LLM
- File: python/llm/example/GPU/LLM-Finetuning/LoRA/alpaca_lora_finetuning.py
- Lines: 161-178
Signature
model = AutoModelForCausalLM.from_pretrained(
model_id: str,
load_in_low_bit: str = "bf16",
optimize_model: bool = False,
torch_dtype = torch.bfloat16,
modules_to_not_convert: List[str] = ["lm_head"],
trust_remote_code: bool = True
) -> PreTrainedModel
Import
from ipex_llm.transformers import AutoModelForCausalLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_id | str | Yes | HuggingFace model ID or local path |
| load_in_low_bit | str | Yes | Set to "bf16" for bfloat16 precision |
| optimize_model | bool | Yes | Must be False for training |
| torch_dtype | torch.dtype | No | Compute dtype (torch.bfloat16) |
| modules_to_not_convert | List[str] | No | Layers to exclude from conversion (default ["lm_head"]) |
| trust_remote_code | bool | No | Allow custom model code |
Outputs
| Name | Type | Description |
|---|---|---|
| model | PreTrainedModel | bf16-precision model ready for LoRA adapter injection |
Usage Examples
import torch
from ipex_llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
import os
# Load model in bf16 precision
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_low_bit="bf16",
optimize_model=False,
torch_dtype=torch.bfloat16,
modules_to_not_convert=["lm_head"],
trust_remote_code=True,
)
# Move to XPU device
model = model.to(f'xpu:{os.environ.get("LOCAL_RANK", 0)}')
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token