Principle:Alibaba MNN LLM Runtime Configuration
| Field | Value |
|---|---|
| principle_name | LLM_Runtime_Configuration |
| repository | Alibaba_MNN |
| workflow | LLM_Deployment_Pipeline |
| pipeline_stage | Runtime Configuration |
| principle_type | Conceptual |
| last_updated | 2026-02-10 14:00 GMT |
Overview
LLM Runtime Configuration governs how the MNN inference engine executes LLM inference at runtime. Through a declarative JSON configuration file (config.json), users control hardware backend selection, precision and memory strategies, generation limits, sampling behavior, and advanced features like speculative decoding and KV-cache management. This configuration layer separates deployment tuning from the model itself, allowing the same exported model to run optimally across different hardware targets.
Theoretical Background
Hardware Backend Selection
The MNN runtime supports multiple compute backends, each with different performance and compatibility characteristics:
- CPU: The universal fallback backend, supporting all operations. Performance depends on instruction set support (NEON, AVX, SSE, SME2). Thread count directly impacts throughput.
- OpenCL: GPU acceleration via OpenCL, primarily used on Android devices with compatible GPU drivers. Requires kernel tuning on first run (cached for subsequent runs). The
thread_numparameter has a different meaning for OpenCL: a value of 68 indicates OpenCL buffer storage mode with wide tuning. - Metal: GPU acceleration via Apple Metal, used on iOS and macOS. Supports fused Flash Attention implementations for memory efficiency.
Precision and Memory Strategies
Two key configuration dimensions control the quality-performance tradeoff:
- Precision (
precision): Controls the floating-point precision used during computation."low": Uses fp16 where possible, maximizing throughput on hardware with fp16 support."high": Uses fp32 for higher numerical accuracy at the cost of reduced throughput and increased memory usage.
- Memory (
memory): Controls runtime memory optimization."low": Enables runtime quantization (dynamic quantization of activations during inference), reducing peak memory usage."high": Disables runtime quantization, using full-precision activations for maximum accuracy.
Sampling Theory
Text generation from LLMs is an autoregressive process where each token is sampled from a probability distribution. The configuration supports multiple sampling strategies:
- Greedy (
"greedy"): Always selects the highest-probability token. Deterministic but may produce repetitive or degenerate text. - Temperature (
"temperature"): Scales the logits by1/temperaturebefore softmax. Higher temperature increases randomness; lower temperature approaches greedy behavior. - Top-K (
"topK"): Restricts sampling to the K most probable tokens, then renormalizes. - Top-P (Nucleus) (
"topP"): Restricts sampling to the smallest set of tokens whose cumulative probability exceeds P, then renormalizes. - Min-P (
"minP"): Filters out tokens with probability less thanminP * max_probability. - TFS (Tail-Free Sampling) (
"tfs"): Filters tokens based on the second derivative of the sorted probability distribution. - Typical (
"typical"): Selects tokens with information content close to the expected information content. - Penalty (
"penalty"): Applies a repetition penalty to tokens that have appeared before, with configurable n-gram based penalties. - Mixed (
"mixed"): Chains multiple samplers in sequence, applying each filter in order. This is the recommended approach for diverse yet coherent outputs.
KV-Cache Management
The Key-Value cache stores intermediate attention computations from prior tokens to avoid redundant computation during autoregressive generation:
- Reuse KV (
reuse_kv): When enabled, multi-turn conversations reuse the KV-cache from previous turns, avoiding re-computation of the shared context prefix. - KV-cache mmap (
kvcache_mmap): When memory is insufficient, the KV-cache can be written to disk using memory-mapped files, trading I/O performance for reduced RAM usage. - Chunked processing (
chunk): Limits the maximum number of tokens processed per step during prefill, splitting long prompts into chunks to control peak memory usage.
Attention Modes
The attention_mode parameter controls the Flash Attention implementation and QKV quantization behavior:
- CPU attention: Options 0-2 (no Flash Attention, with varying QKV quantization levels) and 8-10 (Flash Attention, with varying QKV quantization levels). Default is 8 (Flash Attention, no QKV quantization).
- GPU attention: Options 0 (naive attention), 8 (stepwise Flash Attention), and 16 (fused single-op Flash Attention with minimal memory). Default is 8.
Key Design Decisions
- Declarative configuration: All runtime parameters are expressed as JSON key-value pairs, enabling easy version control and reproducibility of deployment configurations.
- Separation from model artifacts: The
config.jsonis generated during export but can be freely modified without re-exporting the model. - Sensible defaults: The export pipeline generates a default configuration targeting CPU with low precision and low memory modes, suitable for mobile deployment out of the box.
- Dynamic reconfiguration: The C++ API supports
llm->set_config()for runtime parameter updates without reloading the model.
Related Pages
- Implementation:Alibaba_MNN_LLM_Config_JSON
- Heuristic:Alibaba_MNN_LLM_Runtime_Tuning
- Principle:Alibaba_MNN_LLM_Engine_Compilation - Previous stage: compiling the engine
- Principle:Alibaba_MNN_LLM_Inference_Execution - Next stage: running inference