Implementation:Intel Ipex llm NPU Model Convert
| Knowledge Sources | |
|---|---|
| Domains | Model_Conversion, NPU, Quantization |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for converting HuggingFace causal language models to low-bit NPU-optimized format for C++ deployment.
Description
This script converts a causal language model from HuggingFace format to a low-bit quantized format suitable for NPU inference via the C++ CLI. It uses IPEX-LLM's NPU-specific AutoModelForCausalLM to load and quantize the model, then saves both the model and tokenizer to a specified directory for downstream C++ inference.
Usage
Use this as a preprocessing step before deploying models via the C++ NPU CLI (llm-cli). The converted model files are optimized for NPU inference and cannot be used with standard HuggingFace APIs.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/convert.py
- Lines: 1-92
Signature
# Script-based execution with argparse
# Key API:
from ipex_llm.transformers.npu_model import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_path,
load_in_low_bit=args.low_bit,
trust_remote_code=True,
optimize_model=True,
)
model.save_low_bit(save_path)
Import
from ipex_llm.transformers.npu_model import AutoModelForCausalLM
from transformers import AutoTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| repo-id-or-model-path | str | Yes | HuggingFace model ID or local path |
| save-path | str | Yes | Output directory for converted model |
| low-bit | str | No | Quantization type (default: sym_int4) |
| max-context-len | int | No | Maximum context length |
| max-prompt-len | int | No | Maximum prompt length |
Outputs
| Name | Type | Description |
|---|---|---|
| Converted model | Files | Low-bit NPU model files in save_path |
| Tokenizer | Files | Copied tokenizer files in save_path |
Usage Examples
Convert Model for NPU C++ CLI
python convert.py \
--repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
--save-path "./llama2-npu" \
--low-bit "sym_int4"