Principle:Alibaba MNN LLM Inference Execution

Field	Value
principle_name	LLM_Inference_Execution
repository	Alibaba_MNN
workflow	LLM_Deployment_Pipeline
pipeline_stage	Inference Execution
principle_type	Conceptual
last_updated	2026-02-10 14:00 GMT

Overview

LLM Inference Execution is the final stage of the MNN LLM deployment pipeline, where the exported and configured model is loaded and used to generate text. This principle covers the theory of autoregressive text generation with KV-cache optimization, the two-phase (prefill + decode) execution model, and the different interaction modes supported by the MNN runtime.

Theoretical Background

Autoregressive Text Generation

Large language models generate text one token at a time in an autoregressive fashion. Given a sequence of input tokens, the model predicts a probability distribution over the vocabulary for the next token. The selected token is appended to the sequence, and the process repeats until a stopping condition is met (end-of-sequence token, maximum token count, or other criteria).

This process divides naturally into two distinct computational phases:

Prefill Phase

The prefill (or "prompt processing") phase processes the entire input prompt in a single forward pass:

All input tokens are processed simultaneously through the transformer layers.
This phase is compute-bound: the computation scales linearly with prompt length, but can be parallelized across tokens within each layer.
The KV-cache is populated with the Key and Value projections for every layer and every input token position.
The prefill phase produces the first output token and the initial KV-cache state.

Performance is measured in tokens per second (tok/s) during prefill, representing how quickly the model can process the input context.

Decode Phase

The decode (or "generation") phase generates tokens one at a time:

Each step processes only the single most recently generated token.
This phase is memory-bandwidth-bound: the entire model weights must be read from memory for each token, but only a single token's worth of computation is performed.
The KV-cache is incrementally extended: the new token's Key and Value projections are appended to the existing cache, and attention is computed over the full cached sequence.
The decode phase continues until a stopping condition is reached.

Performance is measured in tokens per second (tok/s) during decode, representing the generation throughput.

KV-Cache Optimization

The Key-Value cache is the central optimization that makes autoregressive generation practical:

Without KV-cache: Each generated token would require recomputing attention over the entire sequence from scratch, resulting in O(n^2) total computation for generating n tokens.
With KV-cache: Previously computed Key and Value tensors are stored and reused. Each decode step only computes the new token's Q, K, V projections and attends to the cached K/V, reducing per-step computation to O(n) (where n is the current sequence length).
KV-cache reuse (reuse_kv): In multi-turn conversations, the KV-cache from prior turns can be retained, avoiding re-prefilling the conversation history.
KV-cache rollback: For multi-prompt scenarios, the runtime supports selectively rolling back the KV-cache to reuse a common prefix across different continuations.

Interaction Modes

The MNN LLM runtime supports multiple interaction modes:

Interactive chat: A REPL-style interface where the user types prompts and receives streaming responses. Supports /reset to clear conversation state and /exit to quit. Uses ChatMessages with system/user/assistant roles.
Batch evaluation: Processes a file of prompts (one per line), generating responses for each. Used for benchmarking and automated testing.
C-Eval: A specialized mode for multiple-choice evaluation datasets in CSV format, used for academic benchmark evaluation.
Benchmarking: The llm_bench tool performs structured performance measurement across configurable combinations of models, backends, thread counts, prompt lengths, and generation lengths.

Performance Metrics

The MNN runtime collects detailed performance metrics during inference:

Prefill time: Total time to process the input prompt (in seconds)
Decode time: Total time for autoregressive token generation (in seconds)
Sample time: Time spent in the sampling step (token selection from logits)
Prefill speed: Prompt tokens processed per second
Decode speed: Tokens generated per second
Vision speed: Megapixels processed per second (for VL models)
Audio RTF: Real-time factor for audio processing (for Audio models)

Tuning and Warm-Up

Before inference begins, the MNN runtime optionally performs:

Tuning: The tuning_prepare() function pre-optimizes operator configurations for various sequence lengths (1, 5, 10, 20, 30, 50, 100 tokens), ensuring optimal kernel selection at runtime.
Warm-up: An initial inference pass to populate caches and trigger JIT compilation of GPU kernels.

Key Design Decisions

Streaming output: Generation results are streamed to stdout token-by-token, providing responsive user feedback rather than waiting for the complete response.
Executor scope isolation: Each inference session creates a dedicated MNN::Express::ExecutorScope, ensuring thread-safe resource management.
Configurable thinking mode: For models like Qwen3 that support "thinking" tokens, the runtime can disable thinking via a Jinja context configuration, allowing control over whether reasoning tokens are emitted.

Related Pages

Implementation:Alibaba_MNN_LLM_Demo_CLI
Heuristic:Alibaba_MNN_LLM_Runtime_Tuning
Principle:Alibaba_MNN_LLM_Runtime_Configuration - Previous stage: configuring runtime parameters

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment