Implementation:Alibaba MNN MNN Low Memory Runtime

Field	Value
Implementation Name	MNN_Low_Memory_Runtime
Type	API Doc
Topic	Model_Compression
Workflow	Model_Compression
Description	Build-time and runtime configuration for dynamic weight dequantization during inference
Source File(s)	`CMakeLists.txt:L79`, `include/MNN/MNNForwardType.h:L83-101`, `express/Executor.cpp:L28-49`
Last Updated	2026-02-10 14:00 GMT

API Signatures

Build-Time Configuration

cmake .. -DMNN_LOW_MEMORY=ON
make -j8

Runtime Configuration (C++)

MNN::BackendConfig backendConfig;
backendConfig.memory = MNN::BackendConfig::Memory_Low;
backendConfig.precision = MNN::BackendConfig::Precision_Normal;

MNN::ScheduleConfig config;
config.numThread = 4;
config.backendConfig = &backendConfig;

Runtime Configuration (Express API)

MNN::BackendConfig config;
config.memory = MNN::BackendConfig::Memory_Low;
MNN::Express::Executor::getGlobalExecutor()->setGlobalExecutorConfig(
    MNN_FORWARD_CPU, config, numThread);

Source Definitions

CMake Option

From CMakeLists.txt (line 79):

option(MNN_LOW_MEMORY "Build MNN support low memory for weight quant model." OFF)
option(MNN_CPU_WEIGHT_DEQUANT_GEMM "Build MNN CPU weight dequant related gemm kernels." OFF)

BackendConfig Struct

From include/MNN/MNNForwardType.h (lines 83-101):

namespace MNN {
struct BackendConfig {
    enum MemoryMode { Memory_Normal = 0, Memory_High, Memory_Low };

    MemoryMode memory = Memory_Normal;

    enum PowerMode { Power_Normal = 0, Power_High, Power_Low };

    PowerMode power = Power_Normal;

    enum PrecisionMode { Precision_Normal = 0, Precision_High, Precision_Low, Precision_Low_BF16 };

    PrecisionMode precision = Precision_Normal;

    /** user defined context */
    union {
        void* sharedContext = nullptr;
        size_t flags; // Valid for CPU Backend
    };
};

setGlobalExecutorConfig

From express/Executor.cpp (lines 28-49):

void Executor::setGlobalExecutorConfig(MNNForwardType type, const BackendConfig& config, int numberThread) {
    std::lock_guard<std::mutex> _l(mMutex);

    if(type == MNN_FORWARD_AUTO) {
        ScheduleConfig sConfig;
        sConfig.type = type;
        type = Schedule::getAppropriateType(sConfig);
    }
    auto rt = _getOrCreateRuntime(type, &config, numberThread);
    if (rt == nullptr) {
        type = MNN_FORWARD_CPU;
        numberThread = 1;
        rt = _getOrCreateRuntime(type, &config, numberThread);
    }
    MNN_ASSERT(nullptr != rt);
    mAttr->firstType = type;
    mAttr->numThread = numberThread;
    mAttr->config = config;
    mAttr->config.sharedContext = nullptr;
}

Parameters

Build-Time Parameters

Parameter	Type	Default	Description
`MNN_LOW_MEMORY`	CMake BOOL	OFF	Enables compilation of low-memory inference paths for weight-quantized models. When ON, the runtime can use int8 compute during GEMM operations on weight-quantized models.
`MNN_CPU_WEIGHT_DEQUANT_GEMM`	CMake BOOL	OFF	Specifically enables CPU weight dequantization GEMM kernels. Provides finer control over which dequantization kernels are compiled.

Runtime Parameters

Parameter	Type	Values	Description
`BackendConfig.memory`	MemoryMode enum	`Memory_Normal` (0), `Memory_High` (1), `Memory_Low` (2)	Controls memory optimization level. Memory_Low activates dynamic weight dequantization, using int8 compute during GEMM for weight-quantized models. Memory_Normal uses standard float inference. Memory_High may cache additional data for speed.
`BackendConfig.precision`	PrecisionMode enum	`Precision_Normal` (0), `Precision_High` (1), `Precision_Low` (2), `Precision_Low_BF16` (3)	Controls compute precision. Precision_Low enables FP16 computation on capable hardware. Operates independently of the memory setting.
`BackendConfig.power`	PowerMode enum	`Power_Normal` (0), `Power_High` (1), `Power_Low` (2)	Controls CPU power/performance trade-off (core affinity).
`numThread`	int	1	Number of threads for the inference backend. Passed to `setGlobalExecutorConfig` or set via `ScheduleConfig.numThread`.

Inputs

Weight-quantized MNN model -- A model previously compressed with MNNConvert --weightQuantBits. The weights must be stored in quantized format (int4 or int8).
MNN_LOW_MEMORY compiled library -- The MNN shared library must be compiled with -DMNN_LOW_MEMORY=ON for the dynamic dequantization path to be available.

Outputs

Inference session with dynamic weight dequantization -- The runtime creates an inference session where convolution and matrix multiplication operations consume quantized weights directly, performing int8 compute during GEMM instead of dequantizing to float32 first. This results in lower memory usage and potentially faster inference.

Usage Example

Full Workflow

# Step 1: Quantize weights
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_quant.mnn --weightQuantBits 8

# Step 2: Build MNN with low-memory support
cd MNN/build
cmake .. -DMNN_LOW_MEMORY=ON -DMNN_BUILD_SHARED_LIBS=ON
make -j8

// Step 3: Configure runtime for dynamic dequantization
#include <MNN/Interpreter.h>
#include <MNN/MNNForwardType.h>

auto interpreter = MNN::Interpreter::createFromFile("model_quant.mnn");

MNN::ScheduleConfig scheduleConfig;
scheduleConfig.numThread = 4;

MNN::BackendConfig backendConfig;
backendConfig.memory = MNN::BackendConfig::Memory_Low;     // enable dynamic dequant
backendConfig.precision = MNN::BackendConfig::Precision_Normal;
scheduleConfig.backendConfig = &backendConfig;

auto session = interpreter->createSession(scheduleConfig);
interpreter->runSession(session);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment