Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Alibaba MNN GPU Tuning Modes

From Leeroopedia




Knowledge Sources
Domains Optimization, Inference, GPU
Last Updated 2026-02-10 12:00 GMT

Overview

Guide for selecting OpenCL/Vulkan GPU tuning modes and memory configurations to optimize MNN inference on GPU.

Description

MNN provides five GPU tuning levels that control how aggressively the runtime searches for optimal kernel configurations. Additionally, OpenCL supports buffer vs image memory modes that can differ significantly in performance depending on the GPU vendor. Qualcomm GPUs also support kernel recording for batched dispatch.

Usage

Use when first deploying a model on a new GPU to find optimal settings, or when GPU inference is slower than expected.

The Insight (Rule of Thumb)

  • Tuning Level: Start with MNN_GPU_TUNING_WIDE (default, good balance). Try MNN_GPU_TUNING_HEAVY for production deployment (slow init but optimal kernels).
  • Memory Mode: Test both MNN_GPU_MEMORY_BUFFER and MNN_GPU_MEMORY_IMAGE on your target hardware. Performance varies by GPU vendor.
  • Kernel Recording: MNN_GPU_RECORD_BATCH shares one commandBuffer for all ops (Vulkan). MNN_GPU_RECORD_OP records per-op (OpenCL, Qualcomm only).
  • Cache: Save GPU tuning results via RuntimeManager cache to avoid re-tuning on subsequent runs (2x+ startup speedup).
  • Trade-off: Heavy tuning dramatically increases first-run time but produces optimal kernel selection. Caching eliminates this on subsequent runs.

Reasoning

Different GPU architectures prefer different kernel configurations. Auto-tuning finds the best config empirically. Image memory mode uses GPU texture units for faster access on some GPUs, while buffer mode is more predictable.

Code Evidence

Tuning and memory mode enum definitions from `MNNForwardType.h:62-78`:

// GPU tuning levels - control kernel search intensity
MNN_GPU_TUNING_NONE    = 1 << 0,  // No tuning, use defaults
MNN_GPU_TUNING_FAST    = 1 << 2,  // Quick search, good for development
MNN_GPU_TUNING_NORMAL  = 1 << 3,  // Moderate search
MNN_GPU_TUNING_WIDE    = 1 << 4,  // Wide search (default, good balance)
MNN_GPU_TUNING_HEAVY   = 1 << 5,  // Exhaustive search, best for production

// GPU memory modes
MNN_GPU_MEMORY_BUFFER  = 1 << 6,  // Use buffer memory (OpenCL cl_mem buffer)
MNN_GPU_MEMORY_IMAGE   = 1 << 7,  // Use image memory (OpenCL cl_mem image2d)

// GPU kernel recording modes
MNN_GPU_RECORD_OP      = 1 << 8,  // Record per-op (OpenCL, Qualcomm only)
MNN_GPU_RECORD_BATCH   = 1 << 9,  // Record batch (Vulkan, shared commandBuffer)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment