Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Tencent Ncnn Benchmark Timer

From Leeroopedia


Knowledge Sources
Domains Performance Profiling, Cross Platform
Last Updated 2026-02-09 19:00 GMT

Overview

Implements cross-platform high-resolution timing utilities and per-layer benchmarking output for profiling ncnn inference performance.

Description

The benchmark module consists of a header (benchmark.h, 28 lines) and an implementation (benchmark.cpp, 216 lines) within the ncnn namespace.

get_current_time() uses three platform-specific strategies for high-resolution timing:

  • C++11: std::chrono::high_resolution_clock when available (converting microseconds to milliseconds)
  • Windows: QueryPerformanceCounter / QueryPerformanceFrequency for high-resolution timing without C++11
  • POSIX: gettimeofday as a fallback on Unix systems

The C++11 path is selected via a compile-time check: __cplusplus >= 201103L or _MSVC_LANG >= 201103L, excluding RISC-V and SIMPLESTL builds.

sleep() similarly dispatches to std::this_thread::sleep_for, Windows Sleep, POSIX usleep, or nanosleep depending on platform and C++11 availability. The default sleep duration is 1000 milliseconds.

The benchmark functions (compiled only when NCNN_BENCHMARK is defined) print per-layer profiling information to stderr:

  • The basic overload prints layer type, layer name, and elapsed time in milliseconds.
  • The detailed overload additionally formats input and output tensor shapes (supporting 1D through 4D with element packing) and, for convolution-family layers (Convolution, ConvolutionDepthWise, Deconvolution, DeconvolutionDepthWise, and their 3D variants), prints kernel size and stride parameters by downcasting the Layer pointer to the specific layer type.

Usage

Use get_current_time() to measure inference latency in any ncnn application. Enable NCNN_BENCHMARK at build time for automatic per-layer profiling output during inference, which is invaluable for identifying computational bottlenecks.

Code Reference

Source Location

Signature

namespace ncnn {

// Get current timestamp in milliseconds
NCNN_EXPORT double get_current_time();

// Sleep for specified milliseconds (default 1000)
NCNN_EXPORT void sleep(unsigned long long int milliseconds = 1000);

#if NCNN_BENCHMARK
// Basic per-layer timing output
NCNN_EXPORT void benchmark(const Layer* layer, double start, double end);

// Detailed per-layer timing with shape info
NCNN_EXPORT void benchmark(const Layer* layer, const Mat& bottom_blob,
    Mat& top_blob, double start, double end);
#endif

} // namespace ncnn

Import

#include "benchmark.h"

I/O Contract

Inputs

Name Type Required Description
milliseconds unsigned long long int For sleep Duration to sleep in milliseconds (default 1000)
layer const Layer* For benchmark Pointer to the layer being profiled
start double For benchmark Timestamp (from get_current_time) before layer execution
end double For benchmark Timestamp (from get_current_time) after layer execution
bottom_blob const Mat& For detailed benchmark Input tensor to the layer
top_blob Mat& For detailed benchmark Output tensor from the layer

Outputs

Name Type Description
return (get_current_time) double Current timestamp in milliseconds
stderr output (benchmark) text Formatted profiling line: layer type, name, time, shapes, and conv parameters

Usage Examples

Measuring Inference Latency

#include "benchmark.h"
#include "net.h"

ncnn::Net net;
// ... load model ...

double start = ncnn::get_current_time();

ncnn::Extractor ex = net.create_extractor();
ex.input("data", input_mat);
ex.extract("output", output_mat);

double end = ncnn::get_current_time();
fprintf(stderr, "Inference time: %.2f ms\n", end - start);

Per-Layer Benchmarking (Automatic with NCNN_BENCHMARK)

// When built with NCNN_BENCHMARK=ON, ncnn automatically calls
// benchmark() for each layer during inference, producing output like:
//
// Convolution       conv1              0.42ms    | [224, 224,   3 *1] -> [112, 112,  64 *1]  kernel: 7 x 7  stride: 2 x 2
// ReLU              relu1              0.01ms    | [112, 112,  64 *1] -> [112, 112,  64 *1]
// Pooling           pool1              0.15ms    | [112, 112,  64 *1] -> [ 56,  56,  64 *1]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment