Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Alibaba MNN Memory Mode Selection

From Leeroopedia



Knowledge Sources
Domains Optimization, Memory_Management, Quantization
Last Updated 2026-02-10 14:00 GMT

Overview

Guide for selecting the right BackendConfig::MemoryMode (Normal, High, Low) to optimize memory usage and inference speed in MNN.

Description

MNN's BackendConfig provides three memory modes that control the memory-speed tradeoff during inference. Memory_Normal is the default balanced mode that works well for most use cases. Memory_High prioritizes speed over memory by pre-allocating larger buffers and caching intermediate results. Memory_Low enables dynamic weight dequantization for quantized models, reducing runtime memory usage and enabling int8 computation rather than decompressing to float32.

The memory mode is set via the BackendConfig struct and passed to the Interpreter or RuntimeManager at session creation time. It cannot be changed after session creation.

Usage

Use this heuristic when:

  • You are running weight-quantized models and want to enable dynamic quantization acceleration with Memory_Low.
  • You have abundant RAM and want to maximize inference speed with Memory_High.
  • You are experiencing out-of-memory issues and need to reduce the runtime memory footprint.
  • You need to decide which memory mode to configure in your deployment configuration or LLM config JSON.

The Insight (Rule of Thumb)

  • Action: Set backendConfig.memory = BackendConfig::Memory_Low for weight-quantized models.
  • Prerequisite: Must compile MNN with -DMNN_LOW_MEMORY=ON for Memory_Low to take effect. Without this build flag, Memory_Low behaves identically to Memory_Normal.
  • Value: Memory_Low enables runtime int8 computation instead of decompressing weights to float32, providing both memory savings and speed improvements on quantized models.
  • Trade-off: Memory_Low has slight compute overhead from dynamic dequantization but saves significant VRAM. Memory_High uses more memory but may improve speed for repeated inference by caching buffers.
  • Default: Memory_Normal (value 0) is the default and appropriate for non-quantized models or when memory constraints are not critical.

Reasoning

Without Memory_Low, weight-quantized models decompress quantized weights to float32 at runtime. This means the model is smaller on disk but consumes the same amount of RAM during inference as the original unquantized model -- only storage is saved, not runtime memory.

With Memory_Low enabled (and the MNN_LOW_MEMORY build flag set), the quantized weights are used directly in int8 GEMM kernels. The weights remain in their compressed form in memory, and dequantization happens on-the-fly within the compute kernel. This provides both memory savings (weights stay compressed in RAM) and speed improvements (int8 arithmetic is faster than float32 on most hardware with SIMD support).

Memory_High takes the opposite approach: it pre-allocates workspace buffers and may cache intermediate computation results to avoid recomputation. This is useful for latency-sensitive serving scenarios where the same model is called repeatedly and memory is not the bottleneck.

Code evidence from MNNForwardType.h:84:

enum MemoryMode {
    Memory_Normal = 0,
    Memory_High,
    Memory_Low
};

Code evidence from CMakeLists.txt build flag:

option(MNN_LOW_MEMORY "Build MNN support low memory for weight quant model." OFF)

Code evidence from docs/tools/compress.md dynamic quantization section:

When using weight-quantized models with Memory_Low mode, MNN performs dynamic
dequantization within the GEMM kernel. The quantized weights are kept in int8
format in memory and dequantized on-the-fly during matrix multiplication,
saving both memory bandwidth and storage.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment