Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba MNN LLM Demo CLI

From Leeroopedia


Field Value
implementation_name LLM_Demo_CLI
implementation_type API Doc
repository Alibaba_MNN
workflow LLM_Deployment_Pipeline
pipeline_stage Inference Execution
source_file transformers/llm/engine/demo/llm_demo.cpp (L253-302)
last_updated 2026-02-10 14:00 GMT

Summary

This implementation documents the llm_demo and llm_bench command-line tools for running MNN LLM inference. llm_demo provides interactive chat and batch prompt evaluation, while llm_bench provides structured performance benchmarking across multiple configurations. Both tools are built when -DMNN_BUILD_LLM=ON is set during CMake compilation.

API Signatures

llm_demo

# Interactive chat mode
./llm_demo <config.json>

# Batch prompt evaluation
./llm_demo <config.json> <prompt.txt> [max_token_number] [disable_thinking]

llm_bench

./llm_bench [options]

options:
  -h, --help
  -m, --model <filename>                    (default: ./Qwen2.5-1.5B-Instruct)
  -a, --backends <cpu,opencl,metal>         (default: cpu)
  -c, --precision <n>                       (default: 2) | (0:Normal, 1:High, 2:Low)
  -t, --threads <n>                         (default: 4)
  -p, --n-prompt <n>                        (default: 512)
  -n, --n-gen <n>                           (default: 128)
  -pg <pp,tg>                               (default: 512,128)
  -mmp, --mmap <0|1>                        (default: 0)
  -rep, --n-repeat <n>                      (default: 5)
  -kv, --kv-cache <true|false>              (default: false)
  -fp, --file-print <stdout|filename>       (default: stdout)

Source Reference

llm_demo main() (Lines 253-302)

int main(int argc, const char* argv[]) {
    if (argc < 2) {
        std::cout << "Usage: " << argv[0] << " config.json <prompt.txt>" << std::endl;
        return 0;
    }
    MNN::BackendConfig backendConfig;
    auto executor = MNN::Express::Executor::newExecutor(MNN_FORWARD_CPU, backendConfig, 1);
    MNN::Express::ExecutorScope s(executor);

    std::string config_path = argv[1];
    std::cout << "config path is " << config_path << std::endl;
    std::unique_ptr<Llm> llm(Llm::createLLM(config_path));
    llm->set_config("{\"tmp_path\":\"tmp\"}");
    {
        AUTOTIME;
        bool res = llm->load();
        if (!res) {
            MNN_ERROR("LLM init error\n");
            return 0;
        }
    }
    if (true) {
        AUTOTIME;
        tuning_prepare(llm.get());
    }
    if (argc < 3) {
        chat(llm.get());
        return 0;
    }
    int max_token_number = -1;
    if (argc >= 4) {
        std::istringstream os(argv[3]);
        os >> max_token_number;
    }
    if (argc >= 5) {
        MNN_PRINT("Set not thinking, only valid for Qwen3\n");
        llm->set_config(R"({
            "jinja": {
                "context": {
                    "enable_thinking":false
                }
            }
        })");
    }
    std::string prompt_file = argv[2];
    llm->set_config(R"({
        "async":false
    })");
    return eval(llm.get(), prompt_file, max_token_number);
}

chat() Function (Lines 230-252)

void chat(Llm* llm) {
    ChatMessages messages;
    messages.emplace_back("system", "You are a helpful assistant.");
    auto context = llm->getContext();
    while (true) {
        std::cout << "\nUser: ";
        std::string user_str;
        std::getline(std::cin, user_str);
        if (user_str == "/exit") {
            return;
        }
        if (user_str == "/reset") {
            llm->reset();
            std::cout << "\nA: reset done." << std::endl;
            continue;
        }
        messages.emplace_back("user", user_str);
        std::cout << "\nA: " << std::flush;
        llm->response(messages);
        auto assistant_str = context->generate_str;
        messages.emplace_back("assistant", assistant_str);
    }
}

Key Parameters

llm_demo Parameters

Argument Position Description
config.json argv[1] Required. Path to the model's config.json file
prompt.txt argv[2] Optional. Path to a file with one prompt per line for batch evaluation
max_token_number argv[3] Optional. Maximum number of tokens to generate per prompt (-1 = unlimited)
disable_thinking argv[4] Optional. Any value here disables "thinking" mode (Qwen3 only)

llm_bench Parameters

Flag Description Default
-m, --model Path to config.json (comma-separated for multiple models) ./Qwen2.5-1.5B-Instruct
-a, --backends Backends to test (comma-separated: cpu,opencl,metal) cpu
-c, --precision Precision level (0:Normal, 1:High, 2:Low) 2
-t, --threads Thread counts to test (comma-separated) 4
-p, --n-prompt Prompt lengths to test (comma-separated) 512
-n, --n-gen Generation lengths to test (comma-separated) 128
-pg Combined prompt+generation test (comma-separated pairs) 512,128
-mmp, --mmap Enable mmap for model loading (0 or 1) 0
-rep, --n-repeat Number of repetitions per test (results averaged) 5
-kv, --kv-cache Use KV-cache during decode (true/false) false
-fp, --file-print Output destination (stdout or filename for markdown output) stdout

Inputs

  • config.json: Model configuration file path (produced by llmexport.py and optionally hand-tuned)
  • Model files: llm.mnn, llm.mnn.weight, tokenizer.txt, and optionally embeddings_bf16.bin, llm_config.json
  • Text prompt: User-provided text (interactive) or prompt file (batch mode)

Outputs

llm_demo

  • Interactive mode: Generated text streamed to stdout in real-time
  • Batch mode: Generated text for each prompt, followed by performance summary:
#################################
prompt tokens num = 15
decode tokens num = 128
 vision time = 0.00 s
 pixels_mp = 0.00 MP
  audio process time = 0.00 s
  audio input time = 0.00 s
prefill time = 0.12 s
 decode time = 3.45 s
 sample time = 0.02 s
prefill speed = 125.00 tok/s
 decode speed = 37.10 tok/s
 vision speed = 0.000 MP/s
 audio RTF = 0.000
##################################

llm_bench

  • Performance metrics in tabular format (to stdout or markdown file), including prefill speed, decode speed, and standard deviation across repetitions

Usage Examples

Interactive Chat

./llm_demo /path/to/model_dir/config.json
# User: Hello, what is MNN?
# A: MNN is a lightweight deep learning inference engine...
# User: /reset
# A: reset done.
# User: /exit

Batch Prompt Evaluation

# Process each line of prompt.txt as a separate prompt
./llm_demo /path/to/model_dir/config.json prompt.txt

# Limit generation to 256 tokens per prompt
./llm_demo /path/to/model_dir/config.json prompt.txt 256

# Disable thinking mode (Qwen3)
./llm_demo /path/to/model_dir/config.json prompt.txt 256 no_think

Multimodal Input (VL/Audio Models)

# In prompt.txt, embed images with <img> tags:
<img>https://example.com/photo.jpeg</img>Describe this image.

# Specify image dimensions:
<img><hw>280, 420</hw>https://example.com/photo.jpeg</img>Describe this image.

# Embed audio with <audio> tags:
<audio>https://example.com/speech.wav</audio>What is being said?

Performance Benchmarking

# Compare multiple models across multiple backends and configurations
./llm_bench \
    -m ./Qwen2.5-1.5B-Instruct/config.json,./Qwen2.5-0.5B-Instruct/config.json \
    -a cpu,opencl,metal \
    -c 1,2 \
    -t 8,12 \
    -p 16,32 \
    -n 10,20 \
    -pg 8,16 \
    -mmp 0 \
    -rep 4 \
    -kv true \
    -fp ./test_result

Execution Flow

The llm_demo execution follows this sequence:

  1. Executor initialization: Creates a CPU-backed MNN::Express::Executor with an ExecutorScope for resource management.
  2. Model creation: Llm::createLLM(config_path) instantiates the LLM from the configuration file.
  3. Model loading: llm->load() loads model weights, tokenizer, and KV-cache structures (timed by AUTOTIME).
  4. Tuning preparation: tuning_prepare() pre-optimizes operator configurations for sequence lengths [1, 5, 10, 20, 30, 50, 100].
  5. Mode selection:
    • If no prompt file is given (argc < 3): enters interactive chat() loop.
    • If a prompt file is given: enters eval() for batch processing.
  6. Inference: Calls llm->response() for each prompt, which internally performs prefill and decode phases.
  7. Metrics collection: After all prompts, prints performance summary including prefill/decode speeds.

Notes

  • The tmp_path is set to "tmp" by default in llm_demo for mmap cache files.
  • For C-Eval benchmark datasets (CSV format with header id,question,A,B,C,D,answer), the tool automatically switches to the ceval() evaluation mode.
  • The llm_bench tool supports comma-separated lists for most parameters, enabling a combinatorial sweep across all configurations in a single invocation.
  • When -fp is set to a filename, results are appended in markdown table format, preserving previous results.
  • Lines in prompt.txt starting with # are treated as comments and skipped during benchmark evaluation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment