Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Intel Ipex llm NPU LLM CLI Cpp

From Leeroopedia


Knowledge Sources
Domains Cpp_Inference, NPU, Chat_Interface
Last Updated 2026-02-09 04:00 GMT

Overview

Concrete tool for interactive LLM inference on Intel NPU via a C++ CLI application with multi-model chat template support.

Description

This C++ application loads an NPU-optimized LLM model and provides an interactive conversational interface. It supports multiple model-specific chat templates (Llama2, Llama3, Qwen2, MiniCPM, DeepSeek-R1) via the add_chat_history function and performs prefill/decode inference loops via run_generate. It manages tokenization, KV cache, and multi-round conversation context through the NPU-specific C API (npu_llm.h).

Usage

Use this as the inference runtime for NPU-converted models (output of convert.py). It provides a terminal-based chat interface for interactive inference with low-latency NPU acceleration.

Code Reference

Source Location

Signature

std::string add_chat_history(
    npu_model_params model_params,
    std::string new_prompt,
    std::string chat_history,
    bool first_turn
);

std::string run_generate(
    void* model,
    int32_t* input_ids,
    int32_t input_length,
    npu_model_params model_params,
    tokenizer_params tok_params,
    npu_generation_params gen_params
);

int main(int argc, char** argv);

Import

#include "npu_common.h"
#include "npu_llm.h"

I/O Contract

Inputs

Name Type Required Description
-m string Yes Path to converted NPU model directory
-p string No Initial prompt text
-n int No Max tokens to generate (default: 256)
-cnv flag No Enable multi-round conversation mode

Outputs

Name Type Description
Generated text stdout Streaming token-by-token output
Performance stats stdout Tokens per second, prefill and decode latency

Usage Examples

Interactive Chat

./llm-cli -m ./llama2-npu -n 256 -cnv

Single Prompt

./llm-cli -m ./llama2-npu -p "Explain quantum computing" -n 128

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment