Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba MNN LLM Runtime Configuration

From Leeroopedia


Field Value
principle_name LLM_Runtime_Configuration
repository Alibaba_MNN
workflow LLM_Deployment_Pipeline
pipeline_stage Runtime Configuration
principle_type Conceptual
last_updated 2026-02-10 14:00 GMT

Overview

LLM Runtime Configuration governs how the MNN inference engine executes LLM inference at runtime. Through a declarative JSON configuration file (config.json), users control hardware backend selection, precision and memory strategies, generation limits, sampling behavior, and advanced features like speculative decoding and KV-cache management. This configuration layer separates deployment tuning from the model itself, allowing the same exported model to run optimally across different hardware targets.

Theoretical Background

Hardware Backend Selection

The MNN runtime supports multiple compute backends, each with different performance and compatibility characteristics:

  • CPU: The universal fallback backend, supporting all operations. Performance depends on instruction set support (NEON, AVX, SSE, SME2). Thread count directly impacts throughput.
  • OpenCL: GPU acceleration via OpenCL, primarily used on Android devices with compatible GPU drivers. Requires kernel tuning on first run (cached for subsequent runs). The thread_num parameter has a different meaning for OpenCL: a value of 68 indicates OpenCL buffer storage mode with wide tuning.
  • Metal: GPU acceleration via Apple Metal, used on iOS and macOS. Supports fused Flash Attention implementations for memory efficiency.

Precision and Memory Strategies

Two key configuration dimensions control the quality-performance tradeoff:

  • Precision (precision): Controls the floating-point precision used during computation.
    • "low": Uses fp16 where possible, maximizing throughput on hardware with fp16 support.
    • "high": Uses fp32 for higher numerical accuracy at the cost of reduced throughput and increased memory usage.
  • Memory (memory): Controls runtime memory optimization.
    • "low": Enables runtime quantization (dynamic quantization of activations during inference), reducing peak memory usage.
    • "high": Disables runtime quantization, using full-precision activations for maximum accuracy.

Sampling Theory

Text generation from LLMs is an autoregressive process where each token is sampled from a probability distribution. The configuration supports multiple sampling strategies:

  • Greedy ("greedy"): Always selects the highest-probability token. Deterministic but may produce repetitive or degenerate text.
  • Temperature ("temperature"): Scales the logits by 1/temperature before softmax. Higher temperature increases randomness; lower temperature approaches greedy behavior.
  • Top-K ("topK"): Restricts sampling to the K most probable tokens, then renormalizes.
  • Top-P (Nucleus) ("topP"): Restricts sampling to the smallest set of tokens whose cumulative probability exceeds P, then renormalizes.
  • Min-P ("minP"): Filters out tokens with probability less than minP * max_probability.
  • TFS (Tail-Free Sampling) ("tfs"): Filters tokens based on the second derivative of the sorted probability distribution.
  • Typical ("typical"): Selects tokens with information content close to the expected information content.
  • Penalty ("penalty"): Applies a repetition penalty to tokens that have appeared before, with configurable n-gram based penalties.
  • Mixed ("mixed"): Chains multiple samplers in sequence, applying each filter in order. This is the recommended approach for diverse yet coherent outputs.

KV-Cache Management

The Key-Value cache stores intermediate attention computations from prior tokens to avoid redundant computation during autoregressive generation:

  • Reuse KV (reuse_kv): When enabled, multi-turn conversations reuse the KV-cache from previous turns, avoiding re-computation of the shared context prefix.
  • KV-cache mmap (kvcache_mmap): When memory is insufficient, the KV-cache can be written to disk using memory-mapped files, trading I/O performance for reduced RAM usage.
  • Chunked processing (chunk): Limits the maximum number of tokens processed per step during prefill, splitting long prompts into chunks to control peak memory usage.

Attention Modes

The attention_mode parameter controls the Flash Attention implementation and QKV quantization behavior:

  • CPU attention: Options 0-2 (no Flash Attention, with varying QKV quantization levels) and 8-10 (Flash Attention, with varying QKV quantization levels). Default is 8 (Flash Attention, no QKV quantization).
  • GPU attention: Options 0 (naive attention), 8 (stepwise Flash Attention), and 16 (fused single-op Flash Attention with minimal memory). Default is 8.

Key Design Decisions

  • Declarative configuration: All runtime parameters are expressed as JSON key-value pairs, enabling easy version control and reproducibility of deployment configurations.
  • Separation from model artifacts: The config.json is generated during export but can be freely modified without re-exporting the model.
  • Sensible defaults: The export pipeline generates a default configuration targeting CPU with low precision and low memory modes, suitable for mobile deployment out of the box.
  • Dynamic reconfiguration: The C++ API supports llm->set_config() for runtime parameter updates without reloading the model.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment