Implementation:Alibaba MNN PyMNN Runtime Manager

Field	Value
implementation_name	PyMNN_Runtime_Manager
schema_version	0.1.0
workflow	Python_Model_Inference
implementation_type	API_Doc
domain	Deep_Learning_Inference
scope	Creating runtime managers, configuring backends, and loading models for inference
source_file	express/Executor.cpp:L211-389
related_patterns	Hardware_Abstraction, Backend_Selection, Runtime_Resource_Pooling
last_updated	2026-02-10 14:00 GMT

Summary

This implementation covers the Python APIs for configuring MNN's inference runtime and loading models. The nn.create_runtime_manager function creates a RuntimeManager that holds backend-specific resources, and nn.load_module_from_file loads a model file into an executable _Module bound to that runtime. The underlying C++ implementation is in express/Executor.cpp (Executor::RuntimeManager::createRuntimeManager at lines 211-336).

API Signatures

nn.create_runtime_manager

nn.create_runtime_manager(config) -> RuntimeManager

Creates a RuntimeManager from a configuration tuple. The config parameter is a tuple of config dicts.

nn.load_module_from_file

nn.load_module_from_file(
    file_name,
    input_names,
    output_names,
    dynamic=False,
    shape_mutable=False,
    rearrange=False,
    backend=expr.Backend.CPU,
    memory_mode=expr.MemoryMode.Normal,
    power_mode=expr.PowerMode.Normal,
    precision_mode=expr.PrecisionMode.Normal,
    thread_num=4,
    runtime_manager=None
) -> _Module

Loads a model from a .mnn file and returns a _Module ready for inference.

RuntimeManager.set_cache

rt.set_cache(cache_path) -> None

Sets the file path for GPU kernel caching.

RuntimeManager.update_cache

rt.update_cache() -> None

Writes updated GPU kernel cache to disk after inference.

RuntimeManager.set_mode

rt.set_mode(mode) -> None

Sets the session execution mode (e.g., 9 for auto_backend).

RuntimeManager.set_hint

rt.set_hint(mode, value) -> None

Sets execution hints such as tuning parameters.

Parameters

Config Dict Keys (for nn.create_runtime_manager)

Key	Type	Default	Description
backend	int	0	Backend type code: 0=CPU, 1=Metal, 3=OpenCL, 6=CUDA, 7=OpenGL, 9=CoreML
precision	str	'normal'	Precision mode: 'low' (FP16/INT8) or 'normal' (FP32)
numThread	int	4	Number of threads for CPU backend; on GPU auto-mode, defaults to 16 for tuning

nn.load_module_from_file Parameters

Parameter	Type	Default	Description
file_name	str	(required)	Path to the .mnn model file
input_names	list[str]	(required)	Names of input variables in the model (e.g., ['data'])
output_names	list[str]	(required)	Names of output variables in the model (e.g., ['prob'])
dynamic	bool	False	Enable dynamic graph mode
shape_mutable	bool	False	Allow shape changes in internal control flow
rearrange	bool	False	Rearrange input variables
backend	expr.Backend	expr.Backend.CPU	Backend (ignored when runtime_manager is provided)
memory_mode	expr.MemoryMode	expr.MemoryMode.Normal	Memory allocation mode
power_mode	expr.PowerMode	expr.PowerMode.Normal	Power consumption mode
precision_mode	expr.PrecisionMode	expr.PrecisionMode.Normal	Computation precision mode
thread_num	int	4	Number of threads (ignored when runtime_manager is provided)
runtime_manager	RuntimeManager	None	Pre-configured RuntimeManager; overrides backend/thread settings

Inputs

MNN model file (.mnn) -- The serialized neural network model
Runtime config dict -- A dictionary specifying backend, precision, and threading preferences

Outputs

RuntimeManager instance -- Holds runtime resources (thread pool, GPU kernels, memory pool)
_Module instance -- A loaded model ready for inference via forward()

Code Example

import MNN.nn as nn
import MNN.cv as cv
import MNN.numpy as np
import MNN.expr as expr

# Configure runtime for OpenCL GPU with low precision
config = {
    'backend': 3,         # OpenCL
    'precision': 'low',   # FP16
    'numThread': 4
}

# Create RuntimeManager from config
rt = nn.create_runtime_manager((config,))

# Enable GPU kernel caching for faster subsequent loads
rt.set_cache('.cachefile')

# Set auto_backend mode
rt.set_mode(9)

# Set tune_num hint (0 = tuning hint type, 20 = number of tuning iterations)
rt.set_hint(0, 20)

# Load model with the configured runtime
net = nn.load_module_from_file(
    'mobilenet_v1.mnn',
    ['data'],       # input tensor names
    ['prob'],       # output tensor names
    runtime_manager=rt
)

# net is now ready: output = net.forward(input_var)

Simplified Loading Without RuntimeManager

import MNN.nn as nn

# Load with default CPU backend, 4 threads, normal precision
net = nn.load_module_from_file(
    'mobilenet_v1.mnn',
    ['data'],
    ['prob']
)

C++ Implementation Details

The core C++ implementation resides in express/Executor.cpp:

Lines 211-216: Executor::RuntimeManager::createRuntimeManager(vector<ScheduleConfig>&) -- Entry point that delegates to the single-config overload
Lines 311-336: Executor::RuntimeManager::createRuntimeManager(const ScheduleConfig&) -- Creates the RuntimeManager by allocating backend resources based on the config; handles AUTO backend selection (line 318-323) where GPU backends default to numThread=16 for tuning
Lines 299-306: RuntimeManager constructor -- Sets default session modes (Session_Release, Session_Input_User, Session_Output_User)
Lines 349-389: RuntimeManager::setCache -- Loads and validates cached GPU kernel data from disk

// express/Executor.cpp:L311-336 (simplified)
Executor::RuntimeManager* Executor::RuntimeManager::createRuntimeManager(
    const ScheduleConfig &config) {
    auto res = new RuntimeManager;
    auto glo = ExecutorScope::Current();
    std::lock_guard<std::mutex> _l(glo->mMutex);
    auto type = Schedule::getAppropriateType(config);
    int numThread = config.numThread;
    if (config.type == MNN_FORWARD_AUTO) {
        if (type == MNN_FORWARD_OPENCL || type == MNN_FORWARD_METAL) {
            numThread = 16; // AUTO default GPU tuning
        }
    }
    auto rt = glo->_getOrCreateRuntime(type, config.backendConfig,
                                        numThread, false);
    res->mInside->mRuntime.second = originRt.second;
    res->mInside->mRuntime.first.insert(std::make_pair(type, rt));
    return res;
}

Edge Cases and Limitations

Thread safety: RuntimeManager is not thread-safe; do not share across Python threads. Create separate RuntimeManagers for concurrent inference.
Backend availability: If the requested backend (e.g., CUDA) is not compiled in, MNN falls back to CPU. Use MNN_FORWARD_AUTO to let MNN choose the best available backend.
GPU cache invalidation: If the model or backend changes, the cached kernels may be invalid. MNN detects this and resets the cache automatically (Executor.cpp line 384).
Memory ownership: The RuntimeManager owns all allocated runtime resources. Destroying the RuntimeManager releases GPU memory, thread pools, etc.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment