Implementation:Intel Ipex llm NPU LLM CLI Cpp

Knowledge Sources	Intel IPEX-LLM
Domains	Cpp_Inference, NPU, Chat_Interface
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for interactive LLM inference on Intel NPU via a C++ CLI application with multi-model chat template support.

Description

This C++ application loads an NPU-optimized LLM model and provides an interactive conversational interface. It supports multiple model-specific chat templates (Llama2, Llama3, Qwen2, MiniCPM, DeepSeek-R1) via the add_chat_history function and performs prefill/decode inference loops via run_generate. It manages tokenization, KV cache, and multi-round conversation context through the NPU-specific C API (npu_llm.h).

Usage

Use this as the inference runtime for NPU-converted models (output of convert.py). It provides a terminal-based chat interface for interactive inference with low-latency NPU acceleration.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/llm-cli.cpp
Lines: 1-267

Signature

std::string add_chat_history(
    npu_model_params model_params,
    std::string new_prompt,
    std::string chat_history,
    bool first_turn
);

std::string run_generate(
    void* model,
    int32_t* input_ids,
    int32_t input_length,
    npu_model_params model_params,
    tokenizer_params tok_params,
    npu_generation_params gen_params
);

int main(int argc, char** argv);

Import

#include "npu_common.h"
#include "npu_llm.h"

I/O Contract

Inputs

Name	Type	Required	Description
-m	string	Yes	Path to converted NPU model directory
-p	string	No	Initial prompt text
-n	int	No	Max tokens to generate (default: 256)
-cnv	flag	No	Enable multi-round conversation mode

Outputs

Name	Type	Description
Generated text	stdout	Streaming token-by-token output
Performance stats	stdout	Tokens per second, prefill and decode latency

Usage Examples

Interactive Chat

./llm-cli -m ./llama2-npu -n 256 -cnv

Single Prompt

./llm-cli -m ./llama2-npu -p "Explain quantum computing" -n 128

Related Pages

Environment:Intel_Ipex_llm_NPU_Cpp_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment