Workflow:Tencent Ncnn Vulkan GPU Accelerated Inference
| Knowledge Sources | |
|---|---|
| Domains | GPU_Computing, Inference, Edge_Deployment |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
End-to-end process for enabling and configuring Vulkan GPU compute to accelerate ncnn neural network inference on devices with Vulkan-capable GPUs.
Description
This workflow configures ncnn to use Vulkan compute shaders for GPU-accelerated inference. ncnn includes a built-in Vulkan driver loader (simplevk) that eliminates the need for a Vulkan SDK dependency. The workflow covers building ncnn with Vulkan support, detecting available GPU devices, configuring the inference pipeline for GPU execution, managing GPU memory allocation, and handling multi-GPU scenarios. Vulkan acceleration is supported across Intel, AMD, Nvidia, Qualcomm, Apple, and ARM GPUs on Windows, Linux, Android, macOS, and iOS.
Key outcomes:
- GPU-accelerated inference with automatic CPU fallback for unsupported layers
- Reduced inference latency on devices with capable GPUs
- Proper GPU memory management using Vulkan allocators
Usage
Execute this workflow when deploying ncnn on a device with a Vulkan-capable GPU (most GPUs from the last decade) and inference speed on CPU alone is insufficient. This is particularly beneficial on mobile devices with powerful GPUs (Qualcomm Adreno, ARM Mali, Apple GPU) and desktop systems with discrete GPUs.
Execution Steps
Step 1: Build ncnn with Vulkan Support
Configure the CMake build with the NCNN_VULKAN=ON flag to compile ncnn with Vulkan compute support. ncnn's built-in simplevk loader discovers and loads the Vulkan driver at runtime, so no Vulkan SDK installation is required at build time.
Key considerations:
- Enable with -DNCNN_VULKAN=ON in the CMake configuration
- The Vulkan driver must be installed on the target device at runtime
- Verify GPU support using vulkaninfo on the target platform
- ncnn gracefully falls back to CPU if Vulkan is unavailable at runtime
Step 2: Detect and Select GPU Device
At runtime, query the available Vulkan-capable GPU devices using ncnn's GPU enumeration API. Select the appropriate device for inference, which is especially important on systems with multiple GPUs (e.g., integrated + discrete GPU in laptops).
Key considerations:
- Use ncnn::get_gpu_count() to enumerate available devices
- Use ncnn::get_gpu_info(device_index) to query device capabilities
- Device selection is done via net.set_vulkan_device(index) or net.opt.vulkan_device_index
- Device selection must happen before loading the model
Step 3: Configure Vulkan Inference Options
Enable Vulkan compute in the Net options and configure GPU memory allocation strategy. Set the use_vulkan_compute flag and optionally configure custom Vulkan allocators for fine-grained memory control.
Key considerations:
- Set net.opt.use_vulkan_compute = true before loading the model
- For advanced usage, acquire and assign custom blob and staging allocators from the VkDevice
- Custom allocators enable memory pool reuse across multiple inferences
- Reclaim allocators after inference is complete to prevent memory leaks
Step 4: Load Model and Execute GPU Inference
Load the ncnn model normally using load_param and load_model. Create an Extractor and run inference. When Vulkan compute is enabled, ncnn automatically uploads input data to GPU, executes compute shaders for supported layers, and downloads results back to CPU memory.
Key considerations:
- The inference API is identical to CPU inference; no code changes for the forward pass
- ncnn automatically handles CPU-to-GPU and GPU-to-CPU data transfers
- Layers without Vulkan shader implementations fall back to CPU automatically
- First inference may be slower due to Vulkan pipeline compilation and caching
Step 5: Optimize GPU Memory for Production
For production deployment, use Vulkan pipeline caching to avoid repeated shader compilation, and configure zero-copy memory on unified memory devices (common on mobile SoCs). Use VkMat::mapped() to access GPU tensor data without explicit download on devices with shared CPU-GPU memory.
Key considerations:
- Pipeline caching via PipelineCache avoids recompiling Vulkan shaders across runs
- On unified memory devices (most mobile SoCs), zero-copy access eliminates transfer overhead
- Use VkMat::mapped() to get a CPU-accessible pointer to GPU tensor data on unified memory
- Monitor GPU memory usage on memory-constrained mobile devices