Implementation:Intel Ipex llm NPU LLM CLI Cpp
| Knowledge Sources | |
|---|---|
| Domains | Cpp_Inference, NPU, Chat_Interface |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for interactive LLM inference on Intel NPU via a C++ CLI application with multi-model chat template support.
Description
This C++ application loads an NPU-optimized LLM model and provides an interactive conversational interface. It supports multiple model-specific chat templates (Llama2, Llama3, Qwen2, MiniCPM, DeepSeek-R1) via the add_chat_history function and performs prefill/decode inference loops via run_generate. It manages tokenization, KV cache, and multi-round conversation context through the NPU-specific C API (npu_llm.h).
Usage
Use this as the inference runtime for NPU-converted models (output of convert.py). It provides a terminal-based chat interface for interactive inference with low-latency NPU acceleration.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/llm-cli.cpp
- Lines: 1-267
Signature
std::string add_chat_history(
npu_model_params model_params,
std::string new_prompt,
std::string chat_history,
bool first_turn
);
std::string run_generate(
void* model,
int32_t* input_ids,
int32_t input_length,
npu_model_params model_params,
tokenizer_params tok_params,
npu_generation_params gen_params
);
int main(int argc, char** argv);
Import
#include "npu_common.h"
#include "npu_llm.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -m | string | Yes | Path to converted NPU model directory |
| -p | string | No | Initial prompt text |
| -n | int | No | Max tokens to generate (default: 256) |
| -cnv | flag | No | Enable multi-round conversation mode |
Outputs
| Name | Type | Description |
|---|---|---|
| Generated text | stdout | Streaming token-by-token output |
| Performance stats | stdout | Tokens per second, prefill and decode latency |
Usage Examples
Interactive Chat
./llm-cli -m ./llama2-npu -n 256 -cnv
Single Prompt
./llm-cli -m ./llama2-npu -p "Explain quantum computing" -n 128