Heuristic:Alibaba MNN NC4HW4 Data Layout

Knowledge Sources	MNN docs/faq.md express/Expr.cpp
Domains	Data_Layout, Optimization, Tensor_Management
Last Updated	2026-02-10 14:00 GMT

Overview

Understanding MNN's NC4HW4 internal tensor layout where channels are padded to multiples of 4 for SIMD optimization.

Description

MNN internally uses NC4HW4 layout for CV-related operators where the channel dimension is rounded up to multiples of 4 (UP_DIV(C,4)*4). This means Tensor::elementSize() may return a larger value than N*C*H*W. Users must use copyFromHostTensor/copyToHostTensor to convert between user-space NCHW/NHWC and internal NC4HW4. Directly reading or writing to internal tensor memory will produce incorrect results because the padding and interleaving scheme is opaque.

The NC4HW4 format groups every 4 channels together, interleaving them so that adjacent memory locations contain the same spatial position across 4 channels. When the channel count is not a multiple of 4, the remaining slots are zero-padded.

Usage

Use this knowledge when:

You encounter unexpected tensor sizes where elementSize() does not equal N*C*H*W.
You get corrupted or garbled output from direct tensor memory access on internal tensors.
You need to feed input data to MNN models or extract output data and are deciding how to handle the tensor format.
You are debugging data mismatch issues between MNN inference output and reference implementations.

The Insight (Rule of Thumb)

Action: Never directly read/write internal tensor memory. Use copyFromHostTensor() / copyToHostTensor() for data transfer.
Value: Create host tensors with explicit layout (Tensor::TENSORFLOW for NHWC, Tensor::CAFFE for NCHW) before copying data in or out.
Trade-off: Copy overhead vs correctness. NC4HW4 enables vectorized SIMD operations but requires layout conversion at input/output boundaries.
Diagnostic: If Tensor::elementSize() returns a value larger than expected, this confirms the tensor is using NC4HW4 layout with channel padding.

Reasoning

NC4HW4 aligns the channel dimension to the SIMD width (4 for float32 NEON/SSE), enabling hardware-specific memory access optimizations. By grouping 4 channels together, a single SIMD load instruction can fetch data for 4 channels at the same spatial position, maximizing compute throughput for convolution and other channel-wise operations.

The internal layout is opaque to users and may differ across backends (CPU, GPU, Metal). The CPU backend uses NC4HW4 for most CV operators; the OpenCL backend may use image2d formats; the Vulkan backend has its own layout conventions. The copyFromHostTensor/copyToHostTensor API abstracts these differences, performing whatever layout transformation is needed for the current backend.

Code evidence from docs/faq.md lines 138-139:

MNN 内部对 CV 相关算子采用 NC4HW4 布局，Tensor::elementSize() 可能大于
N*C*H*W。需要使用 copyFromHostTensor/copyToHostTensor 进行数据转换。

Code evidence from express/Expr.cpp format conversion logic:

// Convert between user-space format (NCHW/NHWC) and internal NC4HW4
// The conversion handles channel padding: UP_DIV(C, 4) * 4
if (source->getType() == halide_type_of<float>()) {
    MNNPackC4(dst, src, area, channel);
}

The UP_DIV(C, 4) macro computes (C + 3) / 4, so a 3-channel RGB tensor becomes 4 channels internally (1 padded channel), and a 64-channel feature map remains 64 channels (no padding needed).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment