Principle:Alibaba MNN LLM Inference Execution
| Field | Value |
|---|---|
| principle_name | LLM_Inference_Execution |
| repository | Alibaba_MNN |
| workflow | LLM_Deployment_Pipeline |
| pipeline_stage | Inference Execution |
| principle_type | Conceptual |
| last_updated | 2026-02-10 14:00 GMT |
Overview
LLM Inference Execution is the final stage of the MNN LLM deployment pipeline, where the exported and configured model is loaded and used to generate text. This principle covers the theory of autoregressive text generation with KV-cache optimization, the two-phase (prefill + decode) execution model, and the different interaction modes supported by the MNN runtime.
Theoretical Background
Autoregressive Text Generation
Large language models generate text one token at a time in an autoregressive fashion. Given a sequence of input tokens, the model predicts a probability distribution over the vocabulary for the next token. The selected token is appended to the sequence, and the process repeats until a stopping condition is met (end-of-sequence token, maximum token count, or other criteria).
This process divides naturally into two distinct computational phases:
Prefill Phase
The prefill (or "prompt processing") phase processes the entire input prompt in a single forward pass:
- All input tokens are processed simultaneously through the transformer layers.
- This phase is compute-bound: the computation scales linearly with prompt length, but can be parallelized across tokens within each layer.
- The KV-cache is populated with the Key and Value projections for every layer and every input token position.
- The prefill phase produces the first output token and the initial KV-cache state.
Performance is measured in tokens per second (tok/s) during prefill, representing how quickly the model can process the input context.
Decode Phase
The decode (or "generation") phase generates tokens one at a time:
- Each step processes only the single most recently generated token.
- This phase is memory-bandwidth-bound: the entire model weights must be read from memory for each token, but only a single token's worth of computation is performed.
- The KV-cache is incrementally extended: the new token's Key and Value projections are appended to the existing cache, and attention is computed over the full cached sequence.
- The decode phase continues until a stopping condition is reached.
Performance is measured in tokens per second (tok/s) during decode, representing the generation throughput.
KV-Cache Optimization
The Key-Value cache is the central optimization that makes autoregressive generation practical:
- Without KV-cache: Each generated token would require recomputing attention over the entire sequence from scratch, resulting in O(n^2) total computation for generating n tokens.
- With KV-cache: Previously computed Key and Value tensors are stored and reused. Each decode step only computes the new token's Q, K, V projections and attends to the cached K/V, reducing per-step computation to O(n) (where n is the current sequence length).
- KV-cache reuse (
reuse_kv): In multi-turn conversations, the KV-cache from prior turns can be retained, avoiding re-prefilling the conversation history. - KV-cache rollback: For multi-prompt scenarios, the runtime supports selectively rolling back the KV-cache to reuse a common prefix across different continuations.
Interaction Modes
The MNN LLM runtime supports multiple interaction modes:
- Interactive chat: A REPL-style interface where the user types prompts and receives streaming responses. Supports
/resetto clear conversation state and/exitto quit. UsesChatMessageswith system/user/assistant roles. - Batch evaluation: Processes a file of prompts (one per line), generating responses for each. Used for benchmarking and automated testing.
- C-Eval: A specialized mode for multiple-choice evaluation datasets in CSV format, used for academic benchmark evaluation.
- Benchmarking: The
llm_benchtool performs structured performance measurement across configurable combinations of models, backends, thread counts, prompt lengths, and generation lengths.
Performance Metrics
The MNN runtime collects detailed performance metrics during inference:
- Prefill time: Total time to process the input prompt (in seconds)
- Decode time: Total time for autoregressive token generation (in seconds)
- Sample time: Time spent in the sampling step (token selection from logits)
- Prefill speed: Prompt tokens processed per second
- Decode speed: Tokens generated per second
- Vision speed: Megapixels processed per second (for VL models)
- Audio RTF: Real-time factor for audio processing (for Audio models)
Tuning and Warm-Up
Before inference begins, the MNN runtime optionally performs:
- Tuning: The
tuning_prepare()function pre-optimizes operator configurations for various sequence lengths (1, 5, 10, 20, 30, 50, 100 tokens), ensuring optimal kernel selection at runtime. - Warm-up: An initial inference pass to populate caches and trigger JIT compilation of GPU kernels.
Key Design Decisions
- Streaming output: Generation results are streamed to stdout token-by-token, providing responsive user feedback rather than waiting for the complete response.
- Executor scope isolation: Each inference session creates a dedicated
MNN::Express::ExecutorScope, ensuring thread-safe resource management. - Configurable thinking mode: For models like Qwen3 that support "thinking" tokens, the runtime can disable thinking via a Jinja context configuration, allowing control over whether reasoning tokens are emitted.
Related Pages
- Implementation:Alibaba_MNN_LLM_Demo_CLI
- Heuristic:Alibaba_MNN_LLM_Runtime_Tuning
- Principle:Alibaba_MNN_LLM_Runtime_Configuration - Previous stage: configuring runtime parameters