Implementation:Alibaba MNN MNN Low Memory Runtime
Appearance
| Field | Value |
|---|---|
| Implementation Name | MNN_Low_Memory_Runtime |
| Type | API Doc |
| Topic | Model_Compression |
| Workflow | Model_Compression |
| Description | Build-time and runtime configuration for dynamic weight dequantization during inference |
| Source File(s) | CMakeLists.txt:L79, include/MNN/MNNForwardType.h:L83-101, express/Executor.cpp:L28-49
|
| Last Updated | 2026-02-10 14:00 GMT |
API Signatures
Build-Time Configuration
cmake .. -DMNN_LOW_MEMORY=ON
make -j8
Runtime Configuration (C++)
MNN::BackendConfig backendConfig;
backendConfig.memory = MNN::BackendConfig::Memory_Low;
backendConfig.precision = MNN::BackendConfig::Precision_Normal;
MNN::ScheduleConfig config;
config.numThread = 4;
config.backendConfig = &backendConfig;
Runtime Configuration (Express API)
MNN::BackendConfig config;
config.memory = MNN::BackendConfig::Memory_Low;
MNN::Express::Executor::getGlobalExecutor()->setGlobalExecutorConfig(
MNN_FORWARD_CPU, config, numThread);
Source Definitions
CMake Option
From CMakeLists.txt (line 79):
option(MNN_LOW_MEMORY "Build MNN support low memory for weight quant model." OFF)
option(MNN_CPU_WEIGHT_DEQUANT_GEMM "Build MNN CPU weight dequant related gemm kernels." OFF)
BackendConfig Struct
From include/MNN/MNNForwardType.h (lines 83-101):
namespace MNN {
struct BackendConfig {
enum MemoryMode { Memory_Normal = 0, Memory_High, Memory_Low };
MemoryMode memory = Memory_Normal;
enum PowerMode { Power_Normal = 0, Power_High, Power_Low };
PowerMode power = Power_Normal;
enum PrecisionMode { Precision_Normal = 0, Precision_High, Precision_Low, Precision_Low_BF16 };
PrecisionMode precision = Precision_Normal;
/** user defined context */
union {
void* sharedContext = nullptr;
size_t flags; // Valid for CPU Backend
};
};
setGlobalExecutorConfig
From express/Executor.cpp (lines 28-49):
void Executor::setGlobalExecutorConfig(MNNForwardType type, const BackendConfig& config, int numberThread) {
std::lock_guard<std::mutex> _l(mMutex);
if(type == MNN_FORWARD_AUTO) {
ScheduleConfig sConfig;
sConfig.type = type;
type = Schedule::getAppropriateType(sConfig);
}
auto rt = _getOrCreateRuntime(type, &config, numberThread);
if (rt == nullptr) {
type = MNN_FORWARD_CPU;
numberThread = 1;
rt = _getOrCreateRuntime(type, &config, numberThread);
}
MNN_ASSERT(nullptr != rt);
mAttr->firstType = type;
mAttr->numThread = numberThread;
mAttr->config = config;
mAttr->config.sharedContext = nullptr;
}
Parameters
Build-Time Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
MNN_LOW_MEMORY |
CMake BOOL | OFF | Enables compilation of low-memory inference paths for weight-quantized models. When ON, the runtime can use int8 compute during GEMM operations on weight-quantized models. |
MNN_CPU_WEIGHT_DEQUANT_GEMM |
CMake BOOL | OFF | Specifically enables CPU weight dequantization GEMM kernels. Provides finer control over which dequantization kernels are compiled. |
Runtime Parameters
| Parameter | Type | Values | Description |
|---|---|---|---|
BackendConfig.memory |
MemoryMode enum | Memory_Normal (0), Memory_High (1), Memory_Low (2) |
Controls memory optimization level. Memory_Low activates dynamic weight dequantization, using int8 compute during GEMM for weight-quantized models. Memory_Normal uses standard float inference. Memory_High may cache additional data for speed. |
BackendConfig.precision |
PrecisionMode enum | Precision_Normal (0), Precision_High (1), Precision_Low (2), Precision_Low_BF16 (3) |
Controls compute precision. Precision_Low enables FP16 computation on capable hardware. Operates independently of the memory setting. |
BackendConfig.power |
PowerMode enum | Power_Normal (0), Power_High (1), Power_Low (2) |
Controls CPU power/performance trade-off (core affinity). |
numThread |
int | 1 | Number of threads for the inference backend. Passed to setGlobalExecutorConfig or set via ScheduleConfig.numThread.
|
Inputs
- Weight-quantized MNN model -- A model previously compressed with
MNNConvert --weightQuantBits. The weights must be stored in quantized format (int4 or int8). - MNN_LOW_MEMORY compiled library -- The MNN shared library must be compiled with
-DMNN_LOW_MEMORY=ONfor the dynamic dequantization path to be available.
Outputs
- Inference session with dynamic weight dequantization -- The runtime creates an inference session where convolution and matrix multiplication operations consume quantized weights directly, performing int8 compute during GEMM instead of dequantizing to float32 first. This results in lower memory usage and potentially faster inference.
Usage Example
Full Workflow
# Step 1: Quantize weights
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_quant.mnn --weightQuantBits 8
# Step 2: Build MNN with low-memory support
cd MNN/build
cmake .. -DMNN_LOW_MEMORY=ON -DMNN_BUILD_SHARED_LIBS=ON
make -j8
// Step 3: Configure runtime for dynamic dequantization
#include <MNN/Interpreter.h>
#include <MNN/MNNForwardType.h>
auto interpreter = MNN::Interpreter::createFromFile("model_quant.mnn");
MNN::ScheduleConfig scheduleConfig;
scheduleConfig.numThread = 4;
MNN::BackendConfig backendConfig;
backendConfig.memory = MNN::BackendConfig::Memory_Low; // enable dynamic dequant
backendConfig.precision = MNN::BackendConfig::Precision_Normal;
scheduleConfig.backendConfig = &backendConfig;
auto session = interpreter->createSession(scheduleConfig);
interpreter->runSession(session);
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment