Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba MNN MNN Low Memory Runtime

From Leeroopedia


Field Value
Implementation Name MNN_Low_Memory_Runtime
Type API Doc
Topic Model_Compression
Workflow Model_Compression
Description Build-time and runtime configuration for dynamic weight dequantization during inference
Source File(s) CMakeLists.txt:L79, include/MNN/MNNForwardType.h:L83-101, express/Executor.cpp:L28-49
Last Updated 2026-02-10 14:00 GMT

API Signatures

Build-Time Configuration

cmake .. -DMNN_LOW_MEMORY=ON
make -j8

Runtime Configuration (C++)

MNN::BackendConfig backendConfig;
backendConfig.memory = MNN::BackendConfig::Memory_Low;
backendConfig.precision = MNN::BackendConfig::Precision_Normal;

MNN::ScheduleConfig config;
config.numThread = 4;
config.backendConfig = &backendConfig;

Runtime Configuration (Express API)

MNN::BackendConfig config;
config.memory = MNN::BackendConfig::Memory_Low;
MNN::Express::Executor::getGlobalExecutor()->setGlobalExecutorConfig(
    MNN_FORWARD_CPU, config, numThread);

Source Definitions

CMake Option

From CMakeLists.txt (line 79):

option(MNN_LOW_MEMORY "Build MNN support low memory for weight quant model." OFF)
option(MNN_CPU_WEIGHT_DEQUANT_GEMM "Build MNN CPU weight dequant related gemm kernels." OFF)

BackendConfig Struct

From include/MNN/MNNForwardType.h (lines 83-101):

namespace MNN {
struct BackendConfig {
    enum MemoryMode { Memory_Normal = 0, Memory_High, Memory_Low };

    MemoryMode memory = Memory_Normal;

    enum PowerMode { Power_Normal = 0, Power_High, Power_Low };

    PowerMode power = Power_Normal;

    enum PrecisionMode { Precision_Normal = 0, Precision_High, Precision_Low, Precision_Low_BF16 };

    PrecisionMode precision = Precision_Normal;

    /** user defined context */
    union {
        void* sharedContext = nullptr;
        size_t flags; // Valid for CPU Backend
    };
};

setGlobalExecutorConfig

From express/Executor.cpp (lines 28-49):

void Executor::setGlobalExecutorConfig(MNNForwardType type, const BackendConfig& config, int numberThread) {
    std::lock_guard<std::mutex> _l(mMutex);

    if(type == MNN_FORWARD_AUTO) {
        ScheduleConfig sConfig;
        sConfig.type = type;
        type = Schedule::getAppropriateType(sConfig);
    }
    auto rt = _getOrCreateRuntime(type, &config, numberThread);
    if (rt == nullptr) {
        type = MNN_FORWARD_CPU;
        numberThread = 1;
        rt = _getOrCreateRuntime(type, &config, numberThread);
    }
    MNN_ASSERT(nullptr != rt);
    mAttr->firstType = type;
    mAttr->numThread = numberThread;
    mAttr->config = config;
    mAttr->config.sharedContext = nullptr;
}

Parameters

Build-Time Parameters

Parameter Type Default Description
MNN_LOW_MEMORY CMake BOOL OFF Enables compilation of low-memory inference paths for weight-quantized models. When ON, the runtime can use int8 compute during GEMM operations on weight-quantized models.
MNN_CPU_WEIGHT_DEQUANT_GEMM CMake BOOL OFF Specifically enables CPU weight dequantization GEMM kernels. Provides finer control over which dequantization kernels are compiled.

Runtime Parameters

Parameter Type Values Description
BackendConfig.memory MemoryMode enum Memory_Normal (0), Memory_High (1), Memory_Low (2) Controls memory optimization level. Memory_Low activates dynamic weight dequantization, using int8 compute during GEMM for weight-quantized models. Memory_Normal uses standard float inference. Memory_High may cache additional data for speed.
BackendConfig.precision PrecisionMode enum Precision_Normal (0), Precision_High (1), Precision_Low (2), Precision_Low_BF16 (3) Controls compute precision. Precision_Low enables FP16 computation on capable hardware. Operates independently of the memory setting.
BackendConfig.power PowerMode enum Power_Normal (0), Power_High (1), Power_Low (2) Controls CPU power/performance trade-off (core affinity).
numThread int 1 Number of threads for the inference backend. Passed to setGlobalExecutorConfig or set via ScheduleConfig.numThread.

Inputs

  • Weight-quantized MNN model -- A model previously compressed with MNNConvert --weightQuantBits. The weights must be stored in quantized format (int4 or int8).
  • MNN_LOW_MEMORY compiled library -- The MNN shared library must be compiled with -DMNN_LOW_MEMORY=ON for the dynamic dequantization path to be available.

Outputs

  • Inference session with dynamic weight dequantization -- The runtime creates an inference session where convolution and matrix multiplication operations consume quantized weights directly, performing int8 compute during GEMM instead of dequantizing to float32 first. This results in lower memory usage and potentially faster inference.

Usage Example

Full Workflow

# Step 1: Quantize weights
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_quant.mnn --weightQuantBits 8

# Step 2: Build MNN with low-memory support
cd MNN/build
cmake .. -DMNN_LOW_MEMORY=ON -DMNN_BUILD_SHARED_LIBS=ON
make -j8
// Step 3: Configure runtime for dynamic dequantization
#include <MNN/Interpreter.h>
#include <MNN/MNNForwardType.h>

auto interpreter = MNN::Interpreter::createFromFile("model_quant.mnn");

MNN::ScheduleConfig scheduleConfig;
scheduleConfig.numThread = 4;

MNN::BackendConfig backendConfig;
backendConfig.memory = MNN::BackendConfig::Memory_Low;     // enable dynamic dequant
backendConfig.precision = MNN::BackendConfig::Precision_Normal;
scheduleConfig.backendConfig = &backendConfig;

auto session = interpreter->createSession(scheduleConfig);
interpreter->runSession(session);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment