Heuristic:Alibaba MNN LLM Runtime Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, Inference |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
Performance tuning guide for MNN's LLM inference engine, covering attention modes, KV-cache, memory management, and chunked prefill.
Description
MNN's LLM engine has several runtime configuration knobs that significantly affect performance. Key areas include attention computation mode (with Flash Attention support), KV-cache management (including mmap to disk), chunked prefill for memory-constrained devices, and precision control.
Usage
Use when deploying LLMs on mobile or edge devices where memory and latency are constrained.
The Insight (Rule of Thumb)
- Attention Mode: Set `attention_mode` in config.json. Default 8 enables Flash Attention on CPU. Values 0-10 for CPU, 0/8/16 for GPU. Legacy key `quant_qkv` is deprecated.
- KV-Cache: Enable `kvcache_mmap` to page KV cache to disk for long contexts. Enable `use_mmap` for model weights.
- Chunked Prefill: Set `chunk` (e.g., 128) to limit tokens per forward pass, reducing peak memory. Use `chunk_limits` array for tiered sizes.
- Precision: Set `precision` to `low` for FP16 on supported hardware (50% memory, ~20% faster).
- Threading: `thread_num` 4-8 for CPU; 68 for OpenCL buffer mode.
- Trade-off: mmap reduces RAM but increases latency from disk I/O; chunked prefill reduces peak memory but increases prefill time.
Reasoning
LLMs have large KV caches that grow with context length. Flash Attention reduces memory from O(n^2) to O(n). mmap allows weight/KV offloading to disk. Chunked prefill prevents OOM on long prompts.
Code Evidence
Attention mode with deprecated quant_qkv fallback from `llm.cpp:131-133`:
// attention_mode config, with fallback to deprecated quant_qkv
int attention_mode = config->attention_mode();
if (attention_mode < 0) attention_mode = config->quant_qkv();
Chunk and chunk_limits handling from `llm.cpp:101-122`:
int chunk = config->chunk();
auto chunk_limits = config->chunk_limits();
// chunk_limits is an array of tiered chunk sizes
// e.g., [64, 128, 256] for progressive prefill
if (chunk > 0) {
// Process input in chunks of 'chunk' tokens per forward pass
// Reduces peak memory at the cost of more forward passes
}
KV-cache mmap config from `llmconfig.hpp:450-451`:
bool kvcache_mmap() const; // page KV cache to disk
bool use_mmap() const; // mmap model weights