Heuristic:LaurentMazare Tch rs CuDNN Benchmark Mode

Knowledge Sources	tch-rs cuDNN Documentation
Domains	Optimization, Deep_Learning
Last Updated	2026-02-08 13:00 GMT

Overview

Enabling cuDNN benchmark mode causes cuDNN to auto-tune convolution algorithms during the first forward passes, yielding significant performance improvements on repeated operations with consistent input sizes.

Description

cuDNN provides multiple algorithm implementations for convolutions and other operations. When benchmark mode is enabled via `Cuda::cudnn_set_benchmark(true)`, cuDNN profiles different algorithms during the first network runs and selects the fastest one for the given input dimensions. On subsequent runs, the pre-selected optimal algorithm is reused, resulting in measurably faster execution. This optimization is most effective when input tensor shapes remain constant across iterations (i.e., fixed batch size and image dimensions).

Usage

Use this heuristic when training or running inference with convolutional networks on CUDA GPUs where input dimensions are consistent across batches. It is especially beneficial for image classification models (ResNet, VGG, EfficientNet, etc.) that process fixed 224x224 images. Do not enable this if input sizes vary significantly between iterations, as the auto-tuning results would be invalidated each time.

The Insight (Rule of Thumb)

Action: Call `tch::Cuda::cudnn_set_benchmark(true)` before the training/inference loop.
Value: Boolean flag; `true` enables, `false` disables.
Trade-off: First few iterations are slower (profiling overhead), but all subsequent iterations are faster. Memory usage may slightly increase due to algorithm selection favoring speed over memory.
Compatibility: Only effective when CUDA and cuDNN are available. No-op on CPU or MPS.

Reasoning

cuDNN implements multiple algorithms for each operation (e.g., FFT-based, Winograd, implicit GEMM for convolutions). The optimal algorithm depends on tensor dimensions, GPU architecture, and available memory. Benchmark mode empirically tests each algorithm and caches the winner. For training loops that execute thousands of iterations with identical shapes, the one-time profiling cost is amortized to near zero. PyTorch documentation recommends this for fixed-size inputs.

Code Evidence

cuDNN benchmark API from `src/wrappers/device.rs:74-82`:

/// Sets cudnn benchmark mode.
///
/// When set cudnn will try to optimize the generators durning
/// the first network runs and then use the optimized architecture
/// in the following runs. This can result in significant performance
/// improvements.
pub fn cudnn_set_benchmark(b: bool) {
    unsafe_torch!(torch_sys::cuda::atc_set_benchmark_cudnn(i32::from(b)))
}

FFI declaration from `torch-sys/src/cuda.rs:28-29`:

/// Sets CUDNN benchmark mode.
pub fn atc_set_benchmark_cudnn(b: c_int);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment