Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Intel Ipex llm NPU Save Load

From Leeroopedia


Knowledge Sources
Domains Model_Serialization, NPU, Quantization
Last Updated 2026-02-09 04:00 GMT

Overview

Concrete tool for saving and loading low-bit quantized models for NPU inference using IPEX-LLM's save_low_bit and load_low_bit APIs.

Description

This script demonstrates the model save/load workflow for NPU deployment. It either converts a HuggingFace model to low-bit format and saves it (save_low_bit), or loads a previously saved model (load_low_bit) for fast startup. The script runs 3 inference iterations with timing measurements to benchmark the loaded model's performance with configurable quantization schemes.

Usage

Use this to pre-convert models for repeated NPU inference, avoiding the quantization overhead on subsequent runs. The save/load pattern is essential for production deployments where startup time matters.

Code Reference

Source Location

Signature

# Key API:
from ipex_llm.transformers.npu_model import AutoModelForCausalLM

# Save path:
model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit=low_bit, ...)
model.save_low_bit(save_path)

# Load path:
model = AutoModelForCausalLM.load_low_bit(save_path, ...)

Import

from ipex_llm.transformers.npu_model import AutoModelForCausalLM
from transformers import AutoTokenizer

I/O Contract

Inputs

Name Type Required Description
repo-id-or-model-path str Yes HuggingFace model ID or local path
save-path str No Directory to save converted model
load-path str No Path to load previously saved model
low-bit str No Quantization type (default: sym_int4)
prompt str No Input prompt for inference verification

Outputs

Name Type Description
Saved model files Files Low-bit model in save_path
Generated text Console Inference output for verification
Timing metrics Console Per-iteration latency

Usage Examples

Save Model

python generate.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --save-path "./llama2-npu-saved" \
    --low-bit "sym_int4"

Load and Generate

python generate.py \
    --load-path "./llama2-npu-saved" \
    --prompt "What is deep learning?"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment