Heuristic:Alibaba MNN NC4HW4 Data Layout
| Knowledge Sources | |
|---|---|
| Domains | Data_Layout, Optimization, Tensor_Management |
| Last Updated | 2026-02-10 14:00 GMT |
Overview
Understanding MNN's NC4HW4 internal tensor layout where channels are padded to multiples of 4 for SIMD optimization.
Description
MNN internally uses NC4HW4 layout for CV-related operators where the channel dimension is rounded up to multiples of 4 (UP_DIV(C,4)*4). This means Tensor::elementSize() may return a larger value than N*C*H*W. Users must use copyFromHostTensor/copyToHostTensor to convert between user-space NCHW/NHWC and internal NC4HW4. Directly reading or writing to internal tensor memory will produce incorrect results because the padding and interleaving scheme is opaque.
The NC4HW4 format groups every 4 channels together, interleaving them so that adjacent memory locations contain the same spatial position across 4 channels. When the channel count is not a multiple of 4, the remaining slots are zero-padded.
Usage
Use this knowledge when:
- You encounter unexpected tensor sizes where
elementSize()does not equalN*C*H*W. - You get corrupted or garbled output from direct tensor memory access on internal tensors.
- You need to feed input data to MNN models or extract output data and are deciding how to handle the tensor format.
- You are debugging data mismatch issues between MNN inference output and reference implementations.
The Insight (Rule of Thumb)
- Action: Never directly read/write internal tensor memory. Use
copyFromHostTensor()/copyToHostTensor()for data transfer. - Value: Create host tensors with explicit layout (
Tensor::TENSORFLOWfor NHWC,Tensor::CAFFEfor NCHW) before copying data in or out. - Trade-off: Copy overhead vs correctness. NC4HW4 enables vectorized SIMD operations but requires layout conversion at input/output boundaries.
- Diagnostic: If
Tensor::elementSize()returns a value larger than expected, this confirms the tensor is using NC4HW4 layout with channel padding.
Reasoning
NC4HW4 aligns the channel dimension to the SIMD width (4 for float32 NEON/SSE), enabling hardware-specific memory access optimizations. By grouping 4 channels together, a single SIMD load instruction can fetch data for 4 channels at the same spatial position, maximizing compute throughput for convolution and other channel-wise operations.
The internal layout is opaque to users and may differ across backends (CPU, GPU, Metal). The CPU backend uses NC4HW4 for most CV operators; the OpenCL backend may use image2d formats; the Vulkan backend has its own layout conventions. The copyFromHostTensor/copyToHostTensor API abstracts these differences, performing whatever layout transformation is needed for the current backend.
Code evidence from docs/faq.md lines 138-139:
MNN 内部对 CV 相关算子采用 NC4HW4 布局,Tensor::elementSize() 可能大于
N*C*H*W。需要使用 copyFromHostTensor/copyToHostTensor 进行数据转换。
Code evidence from express/Expr.cpp format conversion logic:
// Convert between user-space format (NCHW/NHWC) and internal NC4HW4
// The conversion handles channel padding: UP_DIV(C, 4) * 4
if (source->getType() == halide_type_of<float>()) {
MNNPackC4(dst, src, area, channel);
}
The UP_DIV(C, 4) macro computes (C + 3) / 4, so a 3-channel RGB tensor becomes 4 channels internally (1 padded channel), and a 64-channel feature map remains 64 channels (no padding needed).