Implementation:Intel Ipex llm NPU Llama2 Inference
| Knowledge Sources | |
|---|---|
| Domains | Inference, NPU, LLM |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for running Llama2 inference on Intel NPU with IPEX-LLM low-bit quantization and optional streaming output.
Description
This script demonstrates the reference NPU inference pattern for Llama2 models. It loads the model using IPEX-LLM's NPU-specific AutoModelForCausalLM with configurable low-bit quantization, formats prompts using the Llama2 chat template via get_prompt(), and runs 5 inference iterations measuring latency. Supports both save/load workflows (converting and saving for future fast loading) and streaming output via HuggingFace's TextStreamer.
Usage
Use this as the reference example for running LLM inference on Intel NPU hardware. The pattern demonstrated here (load, quantize, save, generate) applies to all NPU-compatible models in the IPEX-LLM ecosystem.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: python/llm/example/NPU/HF-Transformers-AutoModels/LLM/llama2.py
- Lines: 1-127
Signature
def get_prompt(message: str, chat_history: list, system_prompt: str) -> str:
"""Format multi-turn conversation with Llama2 template syntax."""
# Key API:
from ipex_llm.transformers.npu_model import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_path,
load_in_low_bit=args.low_bit,
optimize_model=True,
max_context_len=args.max_context_len,
max_prompt_len=args.max_prompt_len,
)
model.save_low_bit(save_path)
# Or load:
model = AutoModelForCausalLM.load_low_bit(save_path, ...)
Import
from ipex_llm.transformers.npu_model import AutoModelForCausalLM
from transformers import AutoTokenizer, TextStreamer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| repo-id-or-model-path | str | Yes | HuggingFace model ID or local path |
| prompt | str | No | Input prompt (default: "What is AI?") |
| n-predict | int | No | Max tokens to generate (default: 32) |
| low-bit | str | No | Quantization type (default: sym_int4) |
| max-context-len | int | No | Maximum context length |
| save-path | str | No | Path to save/load converted model |
Outputs
| Name | Type | Description |
|---|---|---|
| Generated text | Console | Model completion with streaming output |
| Timing metrics | Console | Per-iteration latency measurements |
| Saved model | Files | Low-bit model files (optional) |
Usage Examples
NPU Inference
python llama2.py \
--repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
--prompt "What is artificial intelligence?" \
--n-predict 64 \
--low-bit "sym_int4"
Save and Reload
# First run: convert and save
python llama2.py \
--repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
--save-path "./llama2-npu-saved"
# Subsequent runs: fast load
python llama2.py \
--repo-id-or-model-path "./llama2-npu-saved"