Implementation:Alibaba MNN LLM Demo CLI
Appearance
| Field | Value |
|---|---|
| implementation_name | LLM_Demo_CLI |
| implementation_type | API Doc |
| repository | Alibaba_MNN |
| workflow | LLM_Deployment_Pipeline |
| pipeline_stage | Inference Execution |
| source_file | transformers/llm/engine/demo/llm_demo.cpp (L253-302) |
| last_updated | 2026-02-10 14:00 GMT |
Summary
This implementation documents the llm_demo and llm_bench command-line tools for running MNN LLM inference. llm_demo provides interactive chat and batch prompt evaluation, while llm_bench provides structured performance benchmarking across multiple configurations. Both tools are built when -DMNN_BUILD_LLM=ON is set during CMake compilation.
API Signatures
llm_demo
# Interactive chat mode
./llm_demo <config.json>
# Batch prompt evaluation
./llm_demo <config.json> <prompt.txt> [max_token_number] [disable_thinking]
llm_bench
./llm_bench [options]
options:
-h, --help
-m, --model <filename> (default: ./Qwen2.5-1.5B-Instruct)
-a, --backends <cpu,opencl,metal> (default: cpu)
-c, --precision <n> (default: 2) | (0:Normal, 1:High, 2:Low)
-t, --threads <n> (default: 4)
-p, --n-prompt <n> (default: 512)
-n, --n-gen <n> (default: 128)
-pg <pp,tg> (default: 512,128)
-mmp, --mmap <0|1> (default: 0)
-rep, --n-repeat <n> (default: 5)
-kv, --kv-cache <true|false> (default: false)
-fp, --file-print <stdout|filename> (default: stdout)
Source Reference
llm_demo main() (Lines 253-302)
int main(int argc, const char* argv[]) {
if (argc < 2) {
std::cout << "Usage: " << argv[0] << " config.json <prompt.txt>" << std::endl;
return 0;
}
MNN::BackendConfig backendConfig;
auto executor = MNN::Express::Executor::newExecutor(MNN_FORWARD_CPU, backendConfig, 1);
MNN::Express::ExecutorScope s(executor);
std::string config_path = argv[1];
std::cout << "config path is " << config_path << std::endl;
std::unique_ptr<Llm> llm(Llm::createLLM(config_path));
llm->set_config("{\"tmp_path\":\"tmp\"}");
{
AUTOTIME;
bool res = llm->load();
if (!res) {
MNN_ERROR("LLM init error\n");
return 0;
}
}
if (true) {
AUTOTIME;
tuning_prepare(llm.get());
}
if (argc < 3) {
chat(llm.get());
return 0;
}
int max_token_number = -1;
if (argc >= 4) {
std::istringstream os(argv[3]);
os >> max_token_number;
}
if (argc >= 5) {
MNN_PRINT("Set not thinking, only valid for Qwen3\n");
llm->set_config(R"({
"jinja": {
"context": {
"enable_thinking":false
}
}
})");
}
std::string prompt_file = argv[2];
llm->set_config(R"({
"async":false
})");
return eval(llm.get(), prompt_file, max_token_number);
}
chat() Function (Lines 230-252)
void chat(Llm* llm) {
ChatMessages messages;
messages.emplace_back("system", "You are a helpful assistant.");
auto context = llm->getContext();
while (true) {
std::cout << "\nUser: ";
std::string user_str;
std::getline(std::cin, user_str);
if (user_str == "/exit") {
return;
}
if (user_str == "/reset") {
llm->reset();
std::cout << "\nA: reset done." << std::endl;
continue;
}
messages.emplace_back("user", user_str);
std::cout << "\nA: " << std::flush;
llm->response(messages);
auto assistant_str = context->generate_str;
messages.emplace_back("assistant", assistant_str);
}
}
Key Parameters
llm_demo Parameters
| Argument | Position | Description |
|---|---|---|
config.json |
argv[1] | Required. Path to the model's config.json file
|
prompt.txt |
argv[2] | Optional. Path to a file with one prompt per line for batch evaluation |
max_token_number |
argv[3] | Optional. Maximum number of tokens to generate per prompt (-1 = unlimited) |
disable_thinking |
argv[4] | Optional. Any value here disables "thinking" mode (Qwen3 only) |
llm_bench Parameters
| Flag | Description | Default |
|---|---|---|
-m, --model |
Path to config.json (comma-separated for multiple models) | ./Qwen2.5-1.5B-Instruct
|
-a, --backends |
Backends to test (comma-separated: cpu,opencl,metal) | cpu
|
-c, --precision |
Precision level (0:Normal, 1:High, 2:Low) | 2 |
-t, --threads |
Thread counts to test (comma-separated) | 4 |
-p, --n-prompt |
Prompt lengths to test (comma-separated) | 512 |
-n, --n-gen |
Generation lengths to test (comma-separated) | 128 |
-pg |
Combined prompt+generation test (comma-separated pairs) | 512,128 |
-mmp, --mmap |
Enable mmap for model loading (0 or 1) | 0 |
-rep, --n-repeat |
Number of repetitions per test (results averaged) | 5 |
-kv, --kv-cache |
Use KV-cache during decode (true/false) | false |
-fp, --file-print |
Output destination (stdout or filename for markdown output) | stdout |
Inputs
- config.json: Model configuration file path (produced by
llmexport.pyand optionally hand-tuned) - Model files:
llm.mnn,llm.mnn.weight,tokenizer.txt, and optionallyembeddings_bf16.bin,llm_config.json - Text prompt: User-provided text (interactive) or prompt file (batch mode)
Outputs
llm_demo
- Interactive mode: Generated text streamed to stdout in real-time
- Batch mode: Generated text for each prompt, followed by performance summary:
#################################
prompt tokens num = 15
decode tokens num = 128
vision time = 0.00 s
pixels_mp = 0.00 MP
audio process time = 0.00 s
audio input time = 0.00 s
prefill time = 0.12 s
decode time = 3.45 s
sample time = 0.02 s
prefill speed = 125.00 tok/s
decode speed = 37.10 tok/s
vision speed = 0.000 MP/s
audio RTF = 0.000
##################################
llm_bench
- Performance metrics in tabular format (to stdout or markdown file), including prefill speed, decode speed, and standard deviation across repetitions
Usage Examples
Interactive Chat
./llm_demo /path/to/model_dir/config.json
# User: Hello, what is MNN?
# A: MNN is a lightweight deep learning inference engine...
# User: /reset
# A: reset done.
# User: /exit
Batch Prompt Evaluation
# Process each line of prompt.txt as a separate prompt
./llm_demo /path/to/model_dir/config.json prompt.txt
# Limit generation to 256 tokens per prompt
./llm_demo /path/to/model_dir/config.json prompt.txt 256
# Disable thinking mode (Qwen3)
./llm_demo /path/to/model_dir/config.json prompt.txt 256 no_think
Multimodal Input (VL/Audio Models)
# In prompt.txt, embed images with <img> tags:
<img>https://example.com/photo.jpeg</img>Describe this image.
# Specify image dimensions:
<img><hw>280, 420</hw>https://example.com/photo.jpeg</img>Describe this image.
# Embed audio with <audio> tags:
<audio>https://example.com/speech.wav</audio>What is being said?
Performance Benchmarking
# Compare multiple models across multiple backends and configurations
./llm_bench \
-m ./Qwen2.5-1.5B-Instruct/config.json,./Qwen2.5-0.5B-Instruct/config.json \
-a cpu,opencl,metal \
-c 1,2 \
-t 8,12 \
-p 16,32 \
-n 10,20 \
-pg 8,16 \
-mmp 0 \
-rep 4 \
-kv true \
-fp ./test_result
Execution Flow
The llm_demo execution follows this sequence:
- Executor initialization: Creates a CPU-backed
MNN::Express::Executorwith anExecutorScopefor resource management. - Model creation:
Llm::createLLM(config_path)instantiates the LLM from the configuration file. - Model loading:
llm->load()loads model weights, tokenizer, and KV-cache structures (timed byAUTOTIME). - Tuning preparation:
tuning_prepare()pre-optimizes operator configurations for sequence lengths [1, 5, 10, 20, 30, 50, 100]. - Mode selection:
- If no prompt file is given (argc < 3): enters interactive
chat()loop. - If a prompt file is given: enters
eval()for batch processing.
- If no prompt file is given (argc < 3): enters interactive
- Inference: Calls
llm->response()for each prompt, which internally performs prefill and decode phases. - Metrics collection: After all prompts, prints performance summary including prefill/decode speeds.
Notes
- The
tmp_pathis set to"tmp"by default inllm_demofor mmap cache files. - For C-Eval benchmark datasets (CSV format with header
id,question,A,B,C,D,answer), the tool automatically switches to theceval()evaluation mode. - The
llm_benchtool supports comma-separated lists for most parameters, enabling a combinatorial sweep across all configurations in a single invocation. - When
-fpis set to a filename, results are appended in markdown table format, preserving previous results. - Lines in
prompt.txtstarting with#are treated as comments and skipped during benchmark evaluation.
Related Pages
- Principle:Alibaba_MNN_LLM_Inference_Execution
- Environment:Alibaba_MNN_CPU_Build_Environment
- Heuristic:Alibaba_MNN_LLM_Runtime_Tuning
- Implementation:Alibaba_MNN_LLM_Config_JSON - Previous step: configuring runtime parameters
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment