Heuristic:Alibaba MNN Memory Mode Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management, Quantization |
| Last Updated | 2026-02-10 14:00 GMT |
Overview
Guide for selecting the right BackendConfig::MemoryMode (Normal, High, Low) to optimize memory usage and inference speed in MNN.
Description
MNN's BackendConfig provides three memory modes that control the memory-speed tradeoff during inference. Memory_Normal is the default balanced mode that works well for most use cases. Memory_High prioritizes speed over memory by pre-allocating larger buffers and caching intermediate results. Memory_Low enables dynamic weight dequantization for quantized models, reducing runtime memory usage and enabling int8 computation rather than decompressing to float32.
The memory mode is set via the BackendConfig struct and passed to the Interpreter or RuntimeManager at session creation time. It cannot be changed after session creation.
Usage
Use this heuristic when:
- You are running weight-quantized models and want to enable dynamic quantization acceleration with
Memory_Low. - You have abundant RAM and want to maximize inference speed with
Memory_High. - You are experiencing out-of-memory issues and need to reduce the runtime memory footprint.
- You need to decide which memory mode to configure in your deployment configuration or LLM config JSON.
The Insight (Rule of Thumb)
- Action: Set
backendConfig.memory = BackendConfig::Memory_Lowfor weight-quantized models. - Prerequisite: Must compile MNN with
-DMNN_LOW_MEMORY=ONforMemory_Lowto take effect. Without this build flag,Memory_Lowbehaves identically toMemory_Normal. - Value:
Memory_Lowenables runtime int8 computation instead of decompressing weights to float32, providing both memory savings and speed improvements on quantized models. - Trade-off:
Memory_Lowhas slight compute overhead from dynamic dequantization but saves significant VRAM.Memory_Highuses more memory but may improve speed for repeated inference by caching buffers. - Default:
Memory_Normal(value 0) is the default and appropriate for non-quantized models or when memory constraints are not critical.
Reasoning
Without Memory_Low, weight-quantized models decompress quantized weights to float32 at runtime. This means the model is smaller on disk but consumes the same amount of RAM during inference as the original unquantized model -- only storage is saved, not runtime memory.
With Memory_Low enabled (and the MNN_LOW_MEMORY build flag set), the quantized weights are used directly in int8 GEMM kernels. The weights remain in their compressed form in memory, and dequantization happens on-the-fly within the compute kernel. This provides both memory savings (weights stay compressed in RAM) and speed improvements (int8 arithmetic is faster than float32 on most hardware with SIMD support).
Memory_High takes the opposite approach: it pre-allocates workspace buffers and may cache intermediate computation results to avoid recomputation. This is useful for latency-sensitive serving scenarios where the same model is called repeatedly and memory is not the bottleneck.
Code evidence from MNNForwardType.h:84:
enum MemoryMode {
Memory_Normal = 0,
Memory_High,
Memory_Low
};
Code evidence from CMakeLists.txt build flag:
option(MNN_LOW_MEMORY "Build MNN support low memory for weight quant model." OFF)
Code evidence from docs/tools/compress.md dynamic quantization section:
When using weight-quantized models with Memory_Low mode, MNN performs dynamic
dequantization within the GEMM kernel. The quantized weights are kept in int8
format in memory and dequantized on-the-fly during matrix multiplication,
saving both memory bandwidth and storage.