Workflow:Tencent Ncnn Vulkan GPU Accelerated Inference

Knowledge Sources	ncnn Vulkan Notes Vulkan FAQ Build Guide
Domains	GPU_Computing, Inference, Edge_Deployment
Last Updated	2026-02-09 19:00 GMT

Overview

End-to-end process for enabling and configuring Vulkan GPU compute to accelerate ncnn neural network inference on devices with Vulkan-capable GPUs.

Description

This workflow configures ncnn to use Vulkan compute shaders for GPU-accelerated inference. ncnn includes a built-in Vulkan driver loader (simplevk) that eliminates the need for a Vulkan SDK dependency. The workflow covers building ncnn with Vulkan support, detecting available GPU devices, configuring the inference pipeline for GPU execution, managing GPU memory allocation, and handling multi-GPU scenarios. Vulkan acceleration is supported across Intel, AMD, Nvidia, Qualcomm, Apple, and ARM GPUs on Windows, Linux, Android, macOS, and iOS.

Key outcomes:

GPU-accelerated inference with automatic CPU fallback for unsupported layers
Reduced inference latency on devices with capable GPUs
Proper GPU memory management using Vulkan allocators

Usage

Execute this workflow when deploying ncnn on a device with a Vulkan-capable GPU (most GPUs from the last decade) and inference speed on CPU alone is insufficient. This is particularly beneficial on mobile devices with powerful GPUs (Qualcomm Adreno, ARM Mali, Apple GPU) and desktop systems with discrete GPUs.

Execution Steps

Step 1: Build ncnn with Vulkan Support

Configure the CMake build with the NCNN_VULKAN=ON flag to compile ncnn with Vulkan compute support. ncnn's built-in simplevk loader discovers and loads the Vulkan driver at runtime, so no Vulkan SDK installation is required at build time.

Key considerations:

Enable with -DNCNN_VULKAN=ON in the CMake configuration
The Vulkan driver must be installed on the target device at runtime
Verify GPU support using vulkaninfo on the target platform
ncnn gracefully falls back to CPU if Vulkan is unavailable at runtime

Step 2: Detect and Select GPU Device

At runtime, query the available Vulkan-capable GPU devices using ncnn's GPU enumeration API. Select the appropriate device for inference, which is especially important on systems with multiple GPUs (e.g., integrated + discrete GPU in laptops).

Key considerations:

Use ncnn::get_gpu_count() to enumerate available devices
Use ncnn::get_gpu_info(device_index) to query device capabilities
Device selection is done via net.set_vulkan_device(index) or net.opt.vulkan_device_index
Device selection must happen before loading the model

Step 3: Configure Vulkan Inference Options

Enable Vulkan compute in the Net options and configure GPU memory allocation strategy. Set the use_vulkan_compute flag and optionally configure custom Vulkan allocators for fine-grained memory control.

Key considerations:

Set net.opt.use_vulkan_compute = true before loading the model
For advanced usage, acquire and assign custom blob and staging allocators from the VkDevice
Custom allocators enable memory pool reuse across multiple inferences
Reclaim allocators after inference is complete to prevent memory leaks

Step 4: Load Model and Execute GPU Inference

Load the ncnn model normally using load_param and load_model. Create an Extractor and run inference. When Vulkan compute is enabled, ncnn automatically uploads input data to GPU, executes compute shaders for supported layers, and downloads results back to CPU memory.

Key considerations:

The inference API is identical to CPU inference; no code changes for the forward pass
ncnn automatically handles CPU-to-GPU and GPU-to-CPU data transfers
Layers without Vulkan shader implementations fall back to CPU automatically
First inference may be slower due to Vulkan pipeline compilation and caching

Step 5: Optimize GPU Memory for Production

For production deployment, use Vulkan pipeline caching to avoid repeated shader compilation, and configure zero-copy memory on unified memory devices (common on mobile SoCs). Use VkMat::mapped() to access GPU tensor data without explicit download on devices with shared CPU-GPU memory.

Key considerations:

Pipeline caching via PipelineCache avoids recompiling Vulkan shaders across runs
On unified memory devices (most mobile SoCs), zero-copy access eliminates transfer overhead
Use VkMat::mapped() to get a CPU-accessible pointer to GPU tensor data on unified memory
Monitor GPU memory usage on memory-constrained mobile devices

Execution Diagram

GitHub URL

Workflow Repository