Implementation:Alibaba MNN PyMNN Runtime Manager
| Field | Value |
|---|---|
| implementation_name | PyMNN_Runtime_Manager |
| schema_version | 0.1.0 |
| workflow | Python_Model_Inference |
| implementation_type | API_Doc |
| domain | Deep_Learning_Inference |
| scope | Creating runtime managers, configuring backends, and loading models for inference |
| source_file | express/Executor.cpp:L211-389 |
| related_patterns | Hardware_Abstraction, Backend_Selection, Runtime_Resource_Pooling |
| last_updated | 2026-02-10 14:00 GMT |
Summary
This implementation covers the Python APIs for configuring MNN's inference runtime and loading models. The nn.create_runtime_manager function creates a RuntimeManager that holds backend-specific resources, and nn.load_module_from_file loads a model file into an executable _Module bound to that runtime. The underlying C++ implementation is in express/Executor.cpp (Executor::RuntimeManager::createRuntimeManager at lines 211-336).
API Signatures
nn.create_runtime_manager
nn.create_runtime_manager(config) -> RuntimeManager
Creates a RuntimeManager from a configuration tuple. The config parameter is a tuple of config dicts.
nn.load_module_from_file
nn.load_module_from_file(
file_name,
input_names,
output_names,
dynamic=False,
shape_mutable=False,
rearrange=False,
backend=expr.Backend.CPU,
memory_mode=expr.MemoryMode.Normal,
power_mode=expr.PowerMode.Normal,
precision_mode=expr.PrecisionMode.Normal,
thread_num=4,
runtime_manager=None
) -> _Module
Loads a model from a .mnn file and returns a _Module ready for inference.
RuntimeManager.set_cache
rt.set_cache(cache_path) -> None
Sets the file path for GPU kernel caching.
RuntimeManager.update_cache
rt.update_cache() -> None
Writes updated GPU kernel cache to disk after inference.
RuntimeManager.set_mode
rt.set_mode(mode) -> None
Sets the session execution mode (e.g., 9 for auto_backend).
RuntimeManager.set_hint
rt.set_hint(mode, value) -> None
Sets execution hints such as tuning parameters.
Parameters
Config Dict Keys (for nn.create_runtime_manager)
| Key | Type | Default | Description |
|---|---|---|---|
| backend | int | 0 | Backend type code: 0=CPU, 1=Metal, 3=OpenCL, 6=CUDA, 7=OpenGL, 9=CoreML |
| precision | str | 'normal' | Precision mode: 'low' (FP16/INT8) or 'normal' (FP32) |
| numThread | int | 4 | Number of threads for CPU backend; on GPU auto-mode, defaults to 16 for tuning |
nn.load_module_from_file Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| file_name | str | (required) | Path to the .mnn model file |
| input_names | list[str] | (required) | Names of input variables in the model (e.g., ['data']) |
| output_names | list[str] | (required) | Names of output variables in the model (e.g., ['prob']) |
| dynamic | bool | False | Enable dynamic graph mode |
| shape_mutable | bool | False | Allow shape changes in internal control flow |
| rearrange | bool | False | Rearrange input variables |
| backend | expr.Backend | expr.Backend.CPU | Backend (ignored when runtime_manager is provided) |
| memory_mode | expr.MemoryMode | expr.MemoryMode.Normal | Memory allocation mode |
| power_mode | expr.PowerMode | expr.PowerMode.Normal | Power consumption mode |
| precision_mode | expr.PrecisionMode | expr.PrecisionMode.Normal | Computation precision mode |
| thread_num | int | 4 | Number of threads (ignored when runtime_manager is provided) |
| runtime_manager | RuntimeManager | None | Pre-configured RuntimeManager; overrides backend/thread settings |
Inputs
- MNN model file (.mnn) -- The serialized neural network model
- Runtime config dict -- A dictionary specifying backend, precision, and threading preferences
Outputs
- RuntimeManager instance -- Holds runtime resources (thread pool, GPU kernels, memory pool)
- _Module instance -- A loaded model ready for inference via forward()
Code Example
import MNN.nn as nn
import MNN.cv as cv
import MNN.numpy as np
import MNN.expr as expr
# Configure runtime for OpenCL GPU with low precision
config = {
'backend': 3, # OpenCL
'precision': 'low', # FP16
'numThread': 4
}
# Create RuntimeManager from config
rt = nn.create_runtime_manager((config,))
# Enable GPU kernel caching for faster subsequent loads
rt.set_cache('.cachefile')
# Set auto_backend mode
rt.set_mode(9)
# Set tune_num hint (0 = tuning hint type, 20 = number of tuning iterations)
rt.set_hint(0, 20)
# Load model with the configured runtime
net = nn.load_module_from_file(
'mobilenet_v1.mnn',
['data'], # input tensor names
['prob'], # output tensor names
runtime_manager=rt
)
# net is now ready: output = net.forward(input_var)
Simplified Loading Without RuntimeManager
import MNN.nn as nn
# Load with default CPU backend, 4 threads, normal precision
net = nn.load_module_from_file(
'mobilenet_v1.mnn',
['data'],
['prob']
)
C++ Implementation Details
The core C++ implementation resides in express/Executor.cpp:
- Lines 211-216: Executor::RuntimeManager::createRuntimeManager(vector<ScheduleConfig>&) -- Entry point that delegates to the single-config overload
- Lines 311-336: Executor::RuntimeManager::createRuntimeManager(const ScheduleConfig&) -- Creates the RuntimeManager by allocating backend resources based on the config; handles AUTO backend selection (line 318-323) where GPU backends default to numThread=16 for tuning
- Lines 299-306: RuntimeManager constructor -- Sets default session modes (Session_Release, Session_Input_User, Session_Output_User)
- Lines 349-389: RuntimeManager::setCache -- Loads and validates cached GPU kernel data from disk
// express/Executor.cpp:L311-336 (simplified)
Executor::RuntimeManager* Executor::RuntimeManager::createRuntimeManager(
const ScheduleConfig &config) {
auto res = new RuntimeManager;
auto glo = ExecutorScope::Current();
std::lock_guard<std::mutex> _l(glo->mMutex);
auto type = Schedule::getAppropriateType(config);
int numThread = config.numThread;
if (config.type == MNN_FORWARD_AUTO) {
if (type == MNN_FORWARD_OPENCL || type == MNN_FORWARD_METAL) {
numThread = 16; // AUTO default GPU tuning
}
}
auto rt = glo->_getOrCreateRuntime(type, config.backendConfig,
numThread, false);
res->mInside->mRuntime.second = originRt.second;
res->mInside->mRuntime.first.insert(std::make_pair(type, rt));
return res;
}
Edge Cases and Limitations
- Thread safety: RuntimeManager is not thread-safe; do not share across Python threads. Create separate RuntimeManagers for concurrent inference.
- Backend availability: If the requested backend (e.g., CUDA) is not compiled in, MNN falls back to CPU. Use MNN_FORWARD_AUTO to let MNN choose the best available backend.
- GPU cache invalidation: If the model or backend changes, the cached kernels may be invalid. MNN detects this and resets the cache automatically (Executor.cpp line 384).
- Memory ownership: The RuntimeManager owns all allocated runtime resources. Destroying the RuntimeManager releases GPU memory, thread pools, etc.