Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Alibaba MNN LLM Runtime Tuning

From Leeroopedia




Knowledge Sources
Domains Optimization, LLMs, Inference
Last Updated 2026-02-10 12:00 GMT

Overview

Performance tuning guide for MNN's LLM inference engine, covering attention modes, KV-cache, memory management, and chunked prefill.

Description

MNN's LLM engine has several runtime configuration knobs that significantly affect performance. Key areas include attention computation mode (with Flash Attention support), KV-cache management (including mmap to disk), chunked prefill for memory-constrained devices, and precision control.

Usage

Use when deploying LLMs on mobile or edge devices where memory and latency are constrained.

The Insight (Rule of Thumb)

  • Attention Mode: Set `attention_mode` in config.json. Default 8 enables Flash Attention on CPU. Values 0-10 for CPU, 0/8/16 for GPU. Legacy key `quant_qkv` is deprecated.
  • KV-Cache: Enable `kvcache_mmap` to page KV cache to disk for long contexts. Enable `use_mmap` for model weights.
  • Chunked Prefill: Set `chunk` (e.g., 128) to limit tokens per forward pass, reducing peak memory. Use `chunk_limits` array for tiered sizes.
  • Precision: Set `precision` to `low` for FP16 on supported hardware (50% memory, ~20% faster).
  • Threading: `thread_num` 4-8 for CPU; 68 for OpenCL buffer mode.
  • Trade-off: mmap reduces RAM but increases latency from disk I/O; chunked prefill reduces peak memory but increases prefill time.

Reasoning

LLMs have large KV caches that grow with context length. Flash Attention reduces memory from O(n^2) to O(n). mmap allows weight/KV offloading to disk. Chunked prefill prevents OOM on long prompts.

Code Evidence

Attention mode with deprecated quant_qkv fallback from `llm.cpp:131-133`:

// attention_mode config, with fallback to deprecated quant_qkv
int attention_mode = config->attention_mode();
if (attention_mode < 0) attention_mode = config->quant_qkv();

Chunk and chunk_limits handling from `llm.cpp:101-122`:

int chunk = config->chunk();
auto chunk_limits = config->chunk_limits();
// chunk_limits is an array of tiered chunk sizes
// e.g., [64, 128, 256] for progressive prefill
if (chunk > 0) {
    // Process input in chunks of 'chunk' tokens per forward pass
    // Reduces peak memory at the cost of more forward passes
}

KV-cache mmap config from `llmconfig.hpp:450-451`:

bool kvcache_mmap() const;  // page KV cache to disk
bool use_mmap() const;       // mmap model weights

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment