Implementation:Intel Ipex llm NPU Llama2 Inference

Knowledge Sources	Intel IPEX-LLM
Domains	Inference, NPU, LLM
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for running Llama2 inference on Intel NPU with IPEX-LLM low-bit quantization and optional streaming output.

Description

This script demonstrates the reference NPU inference pattern for Llama2 models. It loads the model using IPEX-LLM's NPU-specific AutoModelForCausalLM with configurable low-bit quantization, formats prompts using the Llama2 chat template via get_prompt(), and runs 5 inference iterations measuring latency. Supports both save/load workflows (converting and saving for future fast loading) and streaming output via HuggingFace's TextStreamer.

Usage

Use this as the reference example for running LLM inference on Intel NPU hardware. The pattern demonstrated here (load, quantize, save, generate) applies to all NPU-compatible models in the IPEX-LLM ecosystem.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: python/llm/example/NPU/HF-Transformers-AutoModels/LLM/llama2.py
Lines: 1-127

Signature

def get_prompt(message: str, chat_history: list, system_prompt: str) -> str:
    """Format multi-turn conversation with Llama2 template syntax."""

# Key API:
from ipex_llm.transformers.npu_model import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    load_in_low_bit=args.low_bit,
    optimize_model=True,
    max_context_len=args.max_context_len,
    max_prompt_len=args.max_prompt_len,
)
model.save_low_bit(save_path)
# Or load:
model = AutoModelForCausalLM.load_low_bit(save_path, ...)

Import

from ipex_llm.transformers.npu_model import AutoModelForCausalLM
from transformers import AutoTokenizer, TextStreamer

I/O Contract

Inputs

Name	Type	Required	Description
repo-id-or-model-path	str	Yes	HuggingFace model ID or local path
prompt	str	No	Input prompt (default: "What is AI?")
n-predict	int	No	Max tokens to generate (default: 32)
low-bit	str	No	Quantization type (default: sym_int4)
max-context-len	int	No	Maximum context length
save-path	str	No	Path to save/load converted model

Outputs

Name	Type	Description
Generated text	Console	Model completion with streaming output
Timing metrics	Console	Per-iteration latency measurements
Saved model	Files	Low-bit model files (optional)

Usage Examples

NPU Inference

python llama2.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --prompt "What is artificial intelligence?" \
    --n-predict 64 \
    --low-bit "sym_int4"

Save and Reload

# First run: convert and save
python llama2.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --save-path "./llama2-npu-saved"

# Subsequent runs: fast load
python llama2.py \
    --repo-id-or-model-path "./llama2-npu-saved"

Related Pages

Environment:Intel_Ipex_llm_NPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment