Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Intel Ipex llm NPU Llama2 Inference

From Leeroopedia


Knowledge Sources
Domains Inference, NPU, LLM
Last Updated 2026-02-09 04:00 GMT

Overview

Concrete tool for running Llama2 inference on Intel NPU with IPEX-LLM low-bit quantization and optional streaming output.

Description

This script demonstrates the reference NPU inference pattern for Llama2 models. It loads the model using IPEX-LLM's NPU-specific AutoModelForCausalLM with configurable low-bit quantization, formats prompts using the Llama2 chat template via get_prompt(), and runs 5 inference iterations measuring latency. Supports both save/load workflows (converting and saving for future fast loading) and streaming output via HuggingFace's TextStreamer.

Usage

Use this as the reference example for running LLM inference on Intel NPU hardware. The pattern demonstrated here (load, quantize, save, generate) applies to all NPU-compatible models in the IPEX-LLM ecosystem.

Code Reference

Source Location

Signature

def get_prompt(message: str, chat_history: list, system_prompt: str) -> str:
    """Format multi-turn conversation with Llama2 template syntax."""

# Key API:
from ipex_llm.transformers.npu_model import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    load_in_low_bit=args.low_bit,
    optimize_model=True,
    max_context_len=args.max_context_len,
    max_prompt_len=args.max_prompt_len,
)
model.save_low_bit(save_path)
# Or load:
model = AutoModelForCausalLM.load_low_bit(save_path, ...)

Import

from ipex_llm.transformers.npu_model import AutoModelForCausalLM
from transformers import AutoTokenizer, TextStreamer

I/O Contract

Inputs

Name Type Required Description
repo-id-or-model-path str Yes HuggingFace model ID or local path
prompt str No Input prompt (default: "What is AI?")
n-predict int No Max tokens to generate (default: 32)
low-bit str No Quantization type (default: sym_int4)
max-context-len int No Maximum context length
save-path str No Path to save/load converted model

Outputs

Name Type Description
Generated text Console Model completion with streaming output
Timing metrics Console Per-iteration latency measurements
Saved model Files Low-bit model files (optional)

Usage Examples

NPU Inference

python llama2.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --prompt "What is artificial intelligence?" \
    --n-predict 64 \
    --low-bit "sym_int4"

Save and Reload

# First run: convert and save
python llama2.py \
    --repo-id-or-model-path "meta-llama/Llama-2-7b-chat-hf" \
    --save-path "./llama2-npu-saved"

# Subsequent runs: fast load
python llama2.py \
    --repo-id-or-model-path "./llama2-npu-saved"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment