Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba MNN Winograd Kernel Codegen

From Leeroopedia


Property Value
Page Type Principle
Domains Deep_Learning, Convolution, GPU_Computing
Date 2026-02-10

Overview

Winograd Kernel Codegen describes the theoretical foundation and design rationale behind the automated generation of GPU kernel source code for Winograd-based fast convolution. In the MNN framework, dedicated code generation tools produce target-specific GPU kernels (OpenCL and GLSL) that implement the Winograd transform, allowing the inference engine to exploit the algorithm's reduced multiplication count on GPU hardware.

The core insight is that Winograd transform matrices are determined at compile time by the choice of tile size and kernel size. By generating GPU kernel source code with these matrices embedded as literal constants, the approach eliminates runtime matrix construction, enables the GPU compiler to perform constant folding and strength reduction, and produces kernels tailored to each specific F(m, r) configuration.

Knowledge Sources

Source Reference Description
Primary Paper Lavin & Gray, 2015. Fast Algorithms for Convolutional Neural Networks. arXiv:1509.09308 Introduces the application of Winograd minimal filtering to deep learning convolutions, demonstrating significant speedups for small-kernel convolutions.
Classical Reference Winograd, S. Arithmetic Complexity of Computations. SIAM, 1980. Original formulation of minimal filtering algorithms that form the mathematical basis.

Theoretical Basis

Winograd Minimal Filtering

The Winograd minimal filtering algorithm computes an m-output convolution with an r-tap filter using only:

F(m, r) requires (m + r - 1) multiplications

compared to the m * r multiplications required by direct convolution. For the commonly used F(2, 3) configuration, this reduces 6 multiplications to 4 — a 1.5x reduction. For F(4, 3), the reduction is from 12 to 6 — a 2x improvement.

Transform Equations

The 1D Winograd convolution of input signal d with filter g is computed as:

Y = A^T [ (G * g) ⊙ (B^T * d) ]

where:

  • G is the filter transform matrix — transforms the convolution filter into the Winograd domain
  • B^T is the input (source) transform matrix — transforms input tiles via B^T * d * B (for 2D)
  • A^T is the output (destination) transform matrix — reconstructs spatial output via A^T * M * A (for 2D)
  • denotes element-wise (Hadamard) multiplication

For 2D convolution, the transforms become matrix-matrix operations:

Transform Formula Purpose
Source (Input) Transform U = B^T * d * B Transform each input tile into the Winograd domain
Element-wise Product M = U ⊙ V Multiply transformed input with pre-transformed filter (V = G * g * G^T)
Destination (Output) Transform Y = A^T * M * A Reconstruct spatial output from Winograd domain products

Why Code Generation?

The transform matrices B, G, and A are fully determined by the parameters m, r, and the choice of interpolation points. This makes them ideal candidates for compile-time embedding:

  1. Constant propagation — GPU compilers can fold multiplications by known constants (especially 0, 1, -1) into additions or eliminations.
  2. Dead code elimination — Transform matrix entries that are zero produce multiply-by-zero terms that the compiler can remove entirely.
  3. Configuration-specific optimization — Each F(m, r) variant gets a dedicated kernel with no branching or runtime parameterization overhead.
  4. No runtime matrix storage — Transform matrices do not need to be allocated in GPU memory or passed as kernel arguments.

Supported Configurations

The code generation tools support arbitrary F(m, r) configurations via command-line parameters. Commonly generated configurations include:

Configuration Input Tile Size Multiplications (Winograd) Multiplications (Direct) Speedup
F(2, 3) 4x4 16 36 2.25x
F(4, 3) 6x6 36 144 4.0x
F(6, 3) 8x8 64 324 5.06x

Note: 2D multiplication counts are the square of the 1D counts. Speedup ratios reflect the theoretical arithmetic advantage before considering transform overhead.

GPU Target Considerations

The same Winograd transform mathematics must be expressed in different GPU shading languages, each with distinct memory models and execution semantics:

Aspect OpenCL Target GLSL/Vulkan Target
File extension .cl .comp
Memory qualifiers __global, __local SSBO with layout(binding=...)
Thread model Work-items in NDRange Invocations in workgroups
Floating-point IEEE 754 with relaxed math options Implementation-defined precision

Both targets embed the same precomputed transform coefficients; only the surrounding kernel scaffolding and memory access patterns differ.

Implemented By

The following implementations realize this principle in MNN:

Implementation Description
Alibaba_MNN_WinogradGenerateCL Command-line tool that generates OpenCL kernel source files (.cl) containing Winograd source and destination transforms.
Alibaba_MNN_WinogradGenerateGLSL Command-line tool that generates Vulkan GLSL compute shader source files (.comp) containing Winograd source and destination transforms.

Related Pages

Trade-offs and Limitations

  • Numerical precision — Larger tile sizes (higher m) increase the dynamic range of transform coefficients, which can cause floating-point precision loss, particularly in half-precision (FP16) execution.
  • Code size — Each F(m, r) configuration produces a separate kernel; supporting many configurations increases binary size.
  • Kernel size restriction — Winograd is most effective for small kernels (3x3, 5x5). For larger kernels, the transform overhead can exceed the multiplication savings.
  • Memory overhead — The transformed input tiles are larger than the original tiles (tile size = m + r - 1), increasing intermediate memory requirements.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment