Implementation:Intel Ipex llm NPU Save Load

Knowledge Sources	Intel IPEX-LLM
Domains	Model_Serialization, NPU, Quantization
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for saving and loading low-bit quantized models for NPU inference using IPEX-LLM's save_low_bit and load_low_bit APIs.

Description

This script demonstrates the model save/load workflow for NPU deployment. It either converts a HuggingFace model to low-bit format and saves it (save_low_bit), or loads a previously saved model (load_low_bit) for fast startup. The script runs 3 inference iterations with timing measurements to benchmark the loaded model's performance with configurable quantization schemes.

Usage

Use this to pre-convert models for repeated NPU inference, avoiding the quantization overhead on subsequent runs. The save/load pattern is essential for production deployments where startup time matters.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: python/llm/example/NPU/HF-Transformers-AutoModels/Save-Load/generate.py
Lines: 1-106

Signature

# Key API:
from ipex_llm.transformers.npu_model import AutoModelForCausalLM

# Save path:
model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit=low_bit, ...)
model.save_low_bit(save_path)

# Load path:
model = AutoModelForCausalLM.load_low_bit(save_path, ...)

Import

from ipex_llm.transformers.npu_model import AutoModelForCausalLM
from transformers import AutoTokenizer

I/O Contract

Inputs

Name	Type	Required	Description
repo-id-or-model-path	str	Yes	HuggingFace model ID or local path
save-path	str	No	Directory to save converted model
load-path	str	No	Path to load previously saved model
low-bit	str	No	Quantization type (default: sym_int4)
prompt	str	No	Input prompt for inference verification

Outputs

Name	Type	Description
Saved model files	Files	Low-bit model in save_path
Generated text	Console	Inference output for verification
Timing metrics	Console	Per-iteration latency

Usage Examples

Save Model

python generate.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --save-path "./llama2-npu-saved" \
    --low-bit "sym_int4"

Load and Generate

python generate.py \
    --load-path "./llama2-npu-saved" \
    --prompt "What is deep learning?"

Related Pages

Environment:Intel_Ipex_llm_NPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment