Principle:Alibaba MNN Runtime Configuration
| Field | Value |
|---|---|
| principle_name | Runtime_Configuration |
| schema_version | 0.1.0 |
| workflow | Python_Model_Inference |
| principle_type | Configuration |
| domain | Deep_Learning_Inference |
| scope | Configuring hardware backends, precision modes, and loading models for inference |
| related_patterns | Hardware_Abstraction, Backend_Selection, Runtime_Resource_Management |
| last_updated | 2026-02-10 14:00 GMT |
Overview
Runtime Configuration addresses the problem of selecting and configuring the compute hardware (CPU, GPU, NPU) on which a neural network model will execute, and then loading the model into that configured runtime environment. This step decouples the model definition from its execution context, enabling the same model to run on different hardware backends without code changes.
Core Concept
MNN provides a hardware-agnostic runtime abstraction layer. Rather than writing backend-specific code, users describe their desired runtime configuration (backend type, precision level, number of threads) in a configuration dictionary, and MNN's runtime manager allocates the appropriate resources. The two key objects in this abstraction are:
- RuntimeManager: Holds runtime resources such as thread pools (CPU), kernel caches (GPU), and memory pools. It encapsulates the execution context.
- _Module: Represents a loaded neural network model bound to a specific runtime. The module contains the computation graph and can execute forward passes.
Theory and Motivation
Why Runtime Configuration is Needed
Modern inference targets span a wide range of hardware: multi-core CPUs, mobile GPUs (via OpenCL, Metal, or Vulkan), dedicated NPUs (via CoreML or NNAPI), and server GPUs (via CUDA). Each backend has different performance characteristics, precision capabilities, and resource requirements. Runtime configuration allows users to:
- Select the optimal backend for the target device (e.g., OpenCL on Android GPU, Metal on iOS GPU, CUDA on server GPU)
- Control precision to trade accuracy for speed (e.g., FP16 "low" precision on GPU)
- Manage threading for CPU execution (number of threads for parallel computation)
- Share resources across multiple models running on the same device via a shared RuntimeManager
- Cache GPU kernels for faster subsequent loads by setting a cache file path
Backend Type Codes
MNN uses integer codes to identify backends:
- 0 -- CPU (default, always available)
- 1 -- Metal (Apple GPU)
- 3 -- OpenCL (cross-platform GPU)
- 6 -- CUDA (NVIDIA GPU)
- 7 -- OpenGL
- 9 -- CoreML (Apple Neural Engine)
Precision Modes
- normal -- Full precision (FP32), maximum accuracy
- low -- Reduced precision (FP16 or INT8 where supported), faster execution at potential accuracy cost
Auto Backend Selection
When the backend is set to MNN_FORWARD_AUTO, MNN automatically selects the best available backend for the device. For GPU backends selected via auto mode, the system defaults to MNN_GPU_TUNING_FAST with numThread=16 for tuning parameters.
How It Fits in the Workflow
Runtime configuration sits between installation/preprocessing and the actual inference execution:
- Upstream: MNN Python package installed, model file (.mnn) available
- This step: Create RuntimeManager with desired config, load model into _Module
- Downstream: _Module.forward() executes the inference on the configured backend
Key Considerations
- RuntimeManager is not thread-safe: A RuntimeManager must not be shared across Python threads. Create separate RuntimeManagers for concurrent inference in different threads.
- GPU kernel caching: Call rt.set_cache(".cachefile") before loading models to enable GPU kernel caching; call rt.update_cache() after inference to persist cached kernels.
- Memory efficiency: The RuntimeManager holds all allocated runtime resources. Destroying the RuntimeManager releases these resources.
- Model-runtime binding: When loading a model with nn.load_module_from_file, the runtime_manager parameter binds the model to a specific runtime. Without it, MNN uses the default CPU backend.
- ScheduleConfig: The Python config dict maps to the C++ ScheduleConfig struct. The createRuntimeManager method in Executor.cpp (line 311-336) processes this config to allocate appropriate backend resources.