Implementation:Tencent Ncnn Benchmark Timer
| Knowledge Sources | |
|---|---|
| Domains | Performance Profiling, Cross Platform |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Implements cross-platform high-resolution timing utilities and per-layer benchmarking output for profiling ncnn inference performance.
Description
The benchmark module consists of a header (benchmark.h, 28 lines) and an implementation (benchmark.cpp, 216 lines) within the ncnn namespace.
get_current_time() uses three platform-specific strategies for high-resolution timing:
- C++11: std::chrono::high_resolution_clock when available (converting microseconds to milliseconds)
- Windows: QueryPerformanceCounter / QueryPerformanceFrequency for high-resolution timing without C++11
- POSIX: gettimeofday as a fallback on Unix systems
The C++11 path is selected via a compile-time check: __cplusplus >= 201103L or _MSVC_LANG >= 201103L, excluding RISC-V and SIMPLESTL builds.
sleep() similarly dispatches to std::this_thread::sleep_for, Windows Sleep, POSIX usleep, or nanosleep depending on platform and C++11 availability. The default sleep duration is 1000 milliseconds.
The benchmark functions (compiled only when NCNN_BENCHMARK is defined) print per-layer profiling information to stderr:
- The basic overload prints layer type, layer name, and elapsed time in milliseconds.
- The detailed overload additionally formats input and output tensor shapes (supporting 1D through 4D with element packing) and, for convolution-family layers (Convolution, ConvolutionDepthWise, Deconvolution, DeconvolutionDepthWise, and their 3D variants), prints kernel size and stride parameters by downcasting the Layer pointer to the specific layer type.
Usage
Use get_current_time() to measure inference latency in any ncnn application. Enable NCNN_BENCHMARK at build time for automatic per-layer profiling output during inference, which is invaluable for identifying computational bottlenecks.
Code Reference
Source Location
- Repository: Tencent_Ncnn
- File: src/benchmark.cpp
- File: src/benchmark.h
Signature
namespace ncnn {
// Get current timestamp in milliseconds
NCNN_EXPORT double get_current_time();
// Sleep for specified milliseconds (default 1000)
NCNN_EXPORT void sleep(unsigned long long int milliseconds = 1000);
#if NCNN_BENCHMARK
// Basic per-layer timing output
NCNN_EXPORT void benchmark(const Layer* layer, double start, double end);
// Detailed per-layer timing with shape info
NCNN_EXPORT void benchmark(const Layer* layer, const Mat& bottom_blob,
Mat& top_blob, double start, double end);
#endif
} // namespace ncnn
Import
#include "benchmark.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| milliseconds | unsigned long long int | For sleep | Duration to sleep in milliseconds (default 1000) |
| layer | const Layer* | For benchmark | Pointer to the layer being profiled |
| start | double | For benchmark | Timestamp (from get_current_time) before layer execution |
| end | double | For benchmark | Timestamp (from get_current_time) after layer execution |
| bottom_blob | const Mat& | For detailed benchmark | Input tensor to the layer |
| top_blob | Mat& | For detailed benchmark | Output tensor from the layer |
Outputs
| Name | Type | Description |
|---|---|---|
| return (get_current_time) | double | Current timestamp in milliseconds |
| stderr output (benchmark) | text | Formatted profiling line: layer type, name, time, shapes, and conv parameters |
Usage Examples
Measuring Inference Latency
#include "benchmark.h"
#include "net.h"
ncnn::Net net;
// ... load model ...
double start = ncnn::get_current_time();
ncnn::Extractor ex = net.create_extractor();
ex.input("data", input_mat);
ex.extract("output", output_mat);
double end = ncnn::get_current_time();
fprintf(stderr, "Inference time: %.2f ms\n", end - start);
Per-Layer Benchmarking (Automatic with NCNN_BENCHMARK)
// When built with NCNN_BENCHMARK=ON, ncnn automatically calls
// benchmark() for each layer during inference, producing output like:
//
// Convolution conv1 0.42ms | [224, 224, 3 *1] -> [112, 112, 64 *1] kernel: 7 x 7 stride: 2 x 2
// ReLU relu1 0.01ms | [112, 112, 64 *1] -> [112, 112, 64 *1]
// Pooling pool1 0.15ms | [112, 112, 64 *1] -> [ 56, 56, 64 *1]