Implementation:Intel Ipex llm NPU Save Load
| Knowledge Sources | |
|---|---|
| Domains | Model_Serialization, NPU, Quantization |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for saving and loading low-bit quantized models for NPU inference using IPEX-LLM's save_low_bit and load_low_bit APIs.
Description
This script demonstrates the model save/load workflow for NPU deployment. It either converts a HuggingFace model to low-bit format and saves it (save_low_bit), or loads a previously saved model (load_low_bit) for fast startup. The script runs 3 inference iterations with timing measurements to benchmark the loaded model's performance with configurable quantization schemes.
Usage
Use this to pre-convert models for repeated NPU inference, avoiding the quantization overhead on subsequent runs. The save/load pattern is essential for production deployments where startup time matters.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: python/llm/example/NPU/HF-Transformers-AutoModels/Save-Load/generate.py
- Lines: 1-106
Signature
# Key API:
from ipex_llm.transformers.npu_model import AutoModelForCausalLM
# Save path:
model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit=low_bit, ...)
model.save_low_bit(save_path)
# Load path:
model = AutoModelForCausalLM.load_low_bit(save_path, ...)
Import
from ipex_llm.transformers.npu_model import AutoModelForCausalLM
from transformers import AutoTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| repo-id-or-model-path | str | Yes | HuggingFace model ID or local path |
| save-path | str | No | Directory to save converted model |
| load-path | str | No | Path to load previously saved model |
| low-bit | str | No | Quantization type (default: sym_int4) |
| prompt | str | No | Input prompt for inference verification |
Outputs
| Name | Type | Description |
|---|---|---|
| Saved model files | Files | Low-bit model in save_path |
| Generated text | Console | Inference output for verification |
| Timing metrics | Console | Per-iteration latency |
Usage Examples
Save Model
python generate.py \
--repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
--save-path "./llama2-npu-saved" \
--low-bit "sym_int4"
Load and Generate
python generate.py \
--load-path "./llama2-npu-saved" \
--prompt "What is deep learning?"