Implementation:Tencent Ncnn Benchmark Timer

Knowledge Sources	Tencent_Ncnn
Domains	Performance Profiling, Cross Platform
Last Updated	2026-02-09 19:00 GMT

Overview

Implements cross-platform high-resolution timing utilities and per-layer benchmarking output for profiling ncnn inference performance.

Description

The benchmark module consists of a header (benchmark.h, 28 lines) and an implementation (benchmark.cpp, 216 lines) within the ncnn namespace.

get_current_time() uses three platform-specific strategies for high-resolution timing:

C++11: std::chrono::high_resolution_clock when available (converting microseconds to milliseconds)
Windows: QueryPerformanceCounter / QueryPerformanceFrequency for high-resolution timing without C++11
POSIX: gettimeofday as a fallback on Unix systems

The C++11 path is selected via a compile-time check: __cplusplus >= 201103L or _MSVC_LANG >= 201103L, excluding RISC-V and SIMPLESTL builds.

sleep() similarly dispatches to std::this_thread::sleep_for, Windows Sleep, POSIX usleep, or nanosleep depending on platform and C++11 availability. The default sleep duration is 1000 milliseconds.

The benchmark functions (compiled only when NCNN_BENCHMARK is defined) print per-layer profiling information to stderr:

The basic overload prints layer type, layer name, and elapsed time in milliseconds.
The detailed overload additionally formats input and output tensor shapes (supporting 1D through 4D with element packing) and, for convolution-family layers (Convolution, ConvolutionDepthWise, Deconvolution, DeconvolutionDepthWise, and their 3D variants), prints kernel size and stride parameters by downcasting the Layer pointer to the specific layer type.

Usage

Use get_current_time() to measure inference latency in any ncnn application. Enable NCNN_BENCHMARK at build time for automatic per-layer profiling output during inference, which is invaluable for identifying computational bottlenecks.

Code Reference

Source Location

Repository: Tencent_Ncnn
File: src/benchmark.cpp
File: src/benchmark.h

Signature

namespace ncnn {

// Get current timestamp in milliseconds
NCNN_EXPORT double get_current_time();

// Sleep for specified milliseconds (default 1000)
NCNN_EXPORT void sleep(unsigned long long int milliseconds = 1000);

#if NCNN_BENCHMARK
// Basic per-layer timing output
NCNN_EXPORT void benchmark(const Layer* layer, double start, double end);

// Detailed per-layer timing with shape info
NCNN_EXPORT void benchmark(const Layer* layer, const Mat& bottom_blob,
    Mat& top_blob, double start, double end);
#endif

} // namespace ncnn

Import

#include "benchmark.h"

I/O Contract

Inputs

Name	Type	Required	Description
milliseconds	unsigned long long int	For sleep	Duration to sleep in milliseconds (default 1000)
layer	const Layer*	For benchmark	Pointer to the layer being profiled
start	double	For benchmark	Timestamp (from get_current_time) before layer execution
end	double	For benchmark	Timestamp (from get_current_time) after layer execution
bottom_blob	const Mat&	For detailed benchmark	Input tensor to the layer
top_blob	Mat&	For detailed benchmark	Output tensor from the layer

Outputs

Name	Type	Description
return (get_current_time)	double	Current timestamp in milliseconds
stderr output (benchmark)	text	Formatted profiling line: layer type, name, time, shapes, and conv parameters

Usage Examples

Measuring Inference Latency

#include "benchmark.h"
#include "net.h"

ncnn::Net net;
// ... load model ...

double start = ncnn::get_current_time();

ncnn::Extractor ex = net.create_extractor();
ex.input("data", input_mat);
ex.extract("output", output_mat);

double end = ncnn::get_current_time();
fprintf(stderr, "Inference time: %.2f ms\n", end - start);

Per-Layer Benchmarking (Automatic with NCNN_BENCHMARK)

// When built with NCNN_BENCHMARK=ON, ncnn automatically calls
// benchmark() for each layer during inference, producing output like:
//
// Convolution       conv1              0.42ms    | [224, 224,   3 *1] -> [112, 112,  64 *1]  kernel: 7 x 7  stride: 2 x 2
// ReLU              relu1              0.01ms    | [112, 112,  64 *1] -> [112, 112,  64 *1]
// Pooling           pool1              0.15ms    | [112, 112,  64 *1] -> [ 56,  56,  64 *1]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment