Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Alibaba MNN NC4HW4 Data Layout

From Leeroopedia



Knowledge Sources
Domains Data_Layout, Optimization, Tensor_Management
Last Updated 2026-02-10 14:00 GMT

Overview

Understanding MNN's NC4HW4 internal tensor layout where channels are padded to multiples of 4 for SIMD optimization.

Description

MNN internally uses NC4HW4 layout for CV-related operators where the channel dimension is rounded up to multiples of 4 (UP_DIV(C,4)*4). This means Tensor::elementSize() may return a larger value than N*C*H*W. Users must use copyFromHostTensor/copyToHostTensor to convert between user-space NCHW/NHWC and internal NC4HW4. Directly reading or writing to internal tensor memory will produce incorrect results because the padding and interleaving scheme is opaque.

The NC4HW4 format groups every 4 channels together, interleaving them so that adjacent memory locations contain the same spatial position across 4 channels. When the channel count is not a multiple of 4, the remaining slots are zero-padded.

Usage

Use this knowledge when:

  • You encounter unexpected tensor sizes where elementSize() does not equal N*C*H*W.
  • You get corrupted or garbled output from direct tensor memory access on internal tensors.
  • You need to feed input data to MNN models or extract output data and are deciding how to handle the tensor format.
  • You are debugging data mismatch issues between MNN inference output and reference implementations.

The Insight (Rule of Thumb)

  • Action: Never directly read/write internal tensor memory. Use copyFromHostTensor() / copyToHostTensor() for data transfer.
  • Value: Create host tensors with explicit layout (Tensor::TENSORFLOW for NHWC, Tensor::CAFFE for NCHW) before copying data in or out.
  • Trade-off: Copy overhead vs correctness. NC4HW4 enables vectorized SIMD operations but requires layout conversion at input/output boundaries.
  • Diagnostic: If Tensor::elementSize() returns a value larger than expected, this confirms the tensor is using NC4HW4 layout with channel padding.

Reasoning

NC4HW4 aligns the channel dimension to the SIMD width (4 for float32 NEON/SSE), enabling hardware-specific memory access optimizations. By grouping 4 channels together, a single SIMD load instruction can fetch data for 4 channels at the same spatial position, maximizing compute throughput for convolution and other channel-wise operations.

The internal layout is opaque to users and may differ across backends (CPU, GPU, Metal). The CPU backend uses NC4HW4 for most CV operators; the OpenCL backend may use image2d formats; the Vulkan backend has its own layout conventions. The copyFromHostTensor/copyToHostTensor API abstracts these differences, performing whatever layout transformation is needed for the current backend.

Code evidence from docs/faq.md lines 138-139:

MNN 内部对 CV 相关算子采用 NC4HW4 布局,Tensor::elementSize() 可能大于
N*C*H*W。需要使用 copyFromHostTensor/copyToHostTensor 进行数据转换。

Code evidence from express/Expr.cpp format conversion logic:

// Convert between user-space format (NCHW/NHWC) and internal NC4HW4
// The conversion handles channel padding: UP_DIV(C, 4) * 4
if (source->getType() == halide_type_of<float>()) {
    MNNPackC4(dst, src, area, channel);
}

The UP_DIV(C, 4) macro computes (C + 3) / 4, so a 3-channel RGB tensor becomes 4 channels internally (1 padded channel), and a 64-channel feature map remains 64 channels (no padding needed).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment