Principle:Alibaba MNN Neural Network Inference

Field	Value
principle_name	Neural_Network_Inference
schema_version	0.1.0
workflow	Python_Model_Inference
principle_type	Execution
domain	Deep_Learning_Inference
scope	Executing a forward pass through a loaded neural network model
related_patterns	Computational_Graph_Evaluation, Lazy_Evaluation, Operator_Dispatch
last_updated	2026-02-10 14:00 GMT

Overview

Neural Network Inference is the step where a preprocessed input tensor flows through the layers of a loaded neural network model, producing an output tensor containing the model's predictions. In MNN, this is achieved by calling the forward method on a loaded _Module, which triggers the evaluation of the model's computational graph on the configured hardware backend.

Core Concept

A neural network inference pass (also called a "forward pass") takes an input tensor and propagates it through a sequence of operations (convolutions, activations, pooling, fully connected layers, etc.) defined by the model's computational graph. Each operation transforms the data according to learned weights, producing intermediate feature maps that ultimately yield the model's output.

In MNN's expression-based API, the forward pass operates on Var objects. A Var represents both data and the expression (computation) that produced it. When forward() is called, MNN evaluates the computational graph lazily or eagerly depending on the executor configuration, dispatching each operation to the appropriate backend (CPU, GPU, etc.).

Theory and Motivation

Computational Graph Evaluation

MNN's inference engine represents models as directed acyclic graphs (DAGs) of operations. Each node in the graph is an Expr (expression) object that references:

An Op (operator) describing the computation (e.g., convolution, ReLU)
Input Vars that the operation consumes
Output Vars that the operation produces

When a Var's value is requested (either explicitly via read() or implicitly via forward()), MNN traverses the graph backward from the output to determine which inputs are needed, validates shapes via Expr::requireInfo() (express/Expr.cpp line 308-364), and then executes the operations in topological order.

Shape Inference

Before executing the actual computation, MNN performs shape inference to determine the output dimensions of each operation. The Expr::requireInfo() method (Expr.cpp line 308) recursively resolves input shapes and computes output shapes. If any input shape is unknown or invalid, the method returns false and marks the expression as invalid.

Variable Creation and Lazy Evaluation

When Variable::create (Expr.cpp line 373-439) is called, MNN may either:

Eagerly evaluate: If lazyEval is disabled, immediately compute the result and store it as a constant
Lazily evaluate: Defer computation until the result is actually needed
Decompose via geometry computation: In CONTENT mode, use the GeometryComputer to decompose complex operations into simpler ones for the backend

NC4HW4 Data Format

MNN's inference kernels are optimized for the NC4HW4 memory layout, which groups every 4 channels together to enable SIMD vectorization. Input tensors should be converted to NC4HW4 format before calling forward() for optimal performance. Some operations internally convert formats, but providing NC4HW4 input avoids unnecessary conversions.

How It Fits in the Workflow

The forward pass is the core computation step in the inference pipeline:

Upstream: Preprocessed input Var in NC4HW4 format, loaded _Module
This step: Execute model.forward(input_var) to run inference
Downstream: Raw output Var passed to postprocessing for interpretation

Key Considerations

Input format: The input Var should be in NC4HW4 format for most models. Using NCHW or NHWC inputs may work but can trigger internal format conversions that reduce performance.
Input shape: The input Var's shape must match what the model expects (including the batch dimension). Shape mismatches cause errors during shape inference.
Single vs. multiple inputs: The forward() method accepts either a single Var or a list of Vars for multi-input models. For multi-output models, use onForward() which returns a list of output Vars.
Output format: The output Var is typically in NC4HW4 format and must be converted to NHWC or NCHW for postprocessing.
Stateless execution: Each forward() call is independent; there is no hidden state between calls (unless the model itself contains stateful operations like RNNs).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment