Heuristic:Tencent Ncnn Vulkan Pipeline Warmup
| Knowledge Sources | |
|---|---|
| Domains | GPU_Compute, Optimization |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Vulkan pipeline warm-up strategy using PipelineCache to avoid first-inference latency spikes caused by GPU shader compilation and pipeline creation.
Description
When ncnn runs GPU inference for the first time, each layer's Vulkan compute pipeline must be created, which involves compiling SPIR-V shaders and allocating GPU resources. This one-time cost can add significant latency to the first inference call. ncnn's PipelineCache system caches compiled pipelines so that subsequent inferences reuse pre-compiled pipelines. For production deployment, a warm-up inference should be run during application initialization (not during the first user-visible request) to front-load this compilation cost. The pipeline cache also supports descriptor set pooling and memory pre-allocation, further reducing per-inference overhead.
Usage
Use this heuristic when deploying ncnn with Vulkan GPU inference in production and first-inference latency matters. Run a dummy inference during application startup to warm up the pipeline cache. Also relevant when benchmarking GPU inference: always include warm-up runs before measurement.
The Insight (Rule of Thumb)
- Action: Run 1-8 warm-up inferences during application initialization before handling real requests. Use the benchmark pattern: `g_warmup_loop_count = 8` from ncnn's benchncnn.
- Value: Warm-up should use the same input shape as production inputs. PipelineCache is created per-VkDevice and shared across Extractors.
- Trade-off: Warm-up adds startup time (typically 0.5-2 seconds per model) but eliminates first-request latency spikes that can be 10-100x slower than steady-state.
- Memory: Use `use_local_pool_allocator = true` (default) and `use_shader_local_memory = true` (default) for optimal GPU memory management.
Reasoning
Vulkan compute pipelines involve GPU driver shader compilation, which is a one-time but expensive operation. The ncnn benchmark tool uses 8 warm-up iterations before timing because the first few iterations include pipeline creation overhead. The PipelineCache system stores compiled pipelines indexed by shader type and specialization constants, making cache lookups O(1). For mobile deployment, the warm-up pattern is critical because mobile GPU drivers have particularly slow shader compilation. The ncnn allocator system (VkBlobAllocator, VkStagingAllocator) also benefits from warm-up because the first allocation triggers pool initialization.
Benchmark warm-up pattern from `benchmark/benchncnn.cpp`:
int g_warmup_loop_count = 8; // warm-up before measurement
int g_loop_count = 4; // actual measurement loops
bool g_enable_cooling_down = true; // 10s sleep for thermal reset
Default GPU memory options from `src/option.cpp:59-62`:
use_local_pool_allocator = true;
use_shader_local_memory = true;
use_cooperative_matrix = true;