Principle:Alibaba MNN Winograd Kernel Codegen

Property	Value
Page Type	Principle
Domains	Deep_Learning, Convolution, GPU_Computing
Date	2026-02-10

Overview

Winograd Kernel Codegen describes the theoretical foundation and design rationale behind the automated generation of GPU kernel source code for Winograd-based fast convolution. In the MNN framework, dedicated code generation tools produce target-specific GPU kernels (OpenCL and GLSL) that implement the Winograd transform, allowing the inference engine to exploit the algorithm's reduced multiplication count on GPU hardware.

The core insight is that Winograd transform matrices are determined at compile time by the choice of tile size and kernel size. By generating GPU kernel source code with these matrices embedded as literal constants, the approach eliminates runtime matrix construction, enables the GPU compiler to perform constant folding and strength reduction, and produces kernels tailored to each specific F(m, r) configuration.

Knowledge Sources

Source	Reference	Description
Primary Paper	Lavin & Gray, 2015. Fast Algorithms for Convolutional Neural Networks. arXiv:1509.09308	Introduces the application of Winograd minimal filtering to deep learning convolutions, demonstrating significant speedups for small-kernel convolutions.
Classical Reference	Winograd, S. Arithmetic Complexity of Computations. SIAM, 1980.	Original formulation of minimal filtering algorithms that form the mathematical basis.

Theoretical Basis

Winograd Minimal Filtering

The Winograd minimal filtering algorithm computes an m-output convolution with an r-tap filter using only:

F(m, r) requires (m + r - 1) multiplications

compared to the m * r multiplications required by direct convolution. For the commonly used F(2, 3) configuration, this reduces 6 multiplications to 4 — a 1.5x reduction. For F(4, 3), the reduction is from 12 to 6 — a 2x improvement.

Transform Equations

The 1D Winograd convolution of input signal d with filter g is computed as:

Y = A^T [ (G * g) ⊙ (B^T * d) ]

where:

G is the filter transform matrix — transforms the convolution filter into the Winograd domain
B^T is the input (source) transform matrix — transforms input tiles via B^T * d * B (for 2D)
A^T is the output (destination) transform matrix — reconstructs spatial output via A^T * M * A (for 2D)
⊙ denotes element-wise (Hadamard) multiplication

For 2D convolution, the transforms become matrix-matrix operations:

Transform	Formula	Purpose
Source (Input) Transform	`U = B^T * d * B`	Transform each input tile into the Winograd domain
Element-wise Product	`M = U ⊙ V`	Multiply transformed input with pre-transformed filter (V = G * g * G^T)
Destination (Output) Transform	`Y = A^T * M * A`	Reconstruct spatial output from Winograd domain products

Why Code Generation?

The transform matrices B, G, and A are fully determined by the parameters m, r, and the choice of interpolation points. This makes them ideal candidates for compile-time embedding:

Constant propagation — GPU compilers can fold multiplications by known constants (especially 0, 1, -1) into additions or eliminations.
Dead code elimination — Transform matrix entries that are zero produce multiply-by-zero terms that the compiler can remove entirely.
Configuration-specific optimization — Each F(m, r) variant gets a dedicated kernel with no branching or runtime parameterization overhead.
No runtime matrix storage — Transform matrices do not need to be allocated in GPU memory or passed as kernel arguments.

Supported Configurations

The code generation tools support arbitrary F(m, r) configurations via command-line parameters. Commonly generated configurations include:

Configuration	Input Tile Size	Multiplications (Winograd)	Multiplications (Direct)	Speedup
F(2, 3)	4x4	16	36	2.25x
F(4, 3)	6x6	36	144	4.0x
F(6, 3)	8x8	64	324	5.06x

Note: 2D multiplication counts are the square of the 1D counts. Speedup ratios reflect the theoretical arithmetic advantage before considering transform overhead.

GPU Target Considerations

The same Winograd transform mathematics must be expressed in different GPU shading languages, each with distinct memory models and execution semantics:

Aspect	OpenCL Target	GLSL/Vulkan Target
File extension	`.cl`	`.comp`
Memory qualifiers	`__global`, `__local`	SSBO with `layout(binding=...)`
Thread model	Work-items in NDRange	Invocations in workgroups
Floating-point	IEEE 754 with relaxed math options	Implementation-defined precision

Both targets embed the same precomputed transform coefficients; only the surrounding kernel scaffolding and memory access patterns differ.

Implemented By

The following implementations realize this principle in MNN:

Implementation	Description
Alibaba_MNN_WinogradGenerateCL	Command-line tool that generates OpenCL kernel source files (`.cl`) containing Winograd source and destination transforms.
Alibaba_MNN_WinogradGenerateGLSL	Command-line tool that generates Vulkan GLSL compute shader source files (`.comp`) containing Winograd source and destination transforms.

Related Pages

Trade-offs and Limitations

Numerical precision — Larger tile sizes (higher m) increase the dynamic range of transform coefficients, which can cause floating-point precision loss, particularly in half-precision (FP16) execution.
Code size — Each F(m, r) configuration produces a separate kernel; supporting many configurations increases binary size.
Kernel size restriction — Winograd is most effective for small kernels (3x3, 5x5). For larger kernels, the transform overhead can exceed the multiplication savings.
Memory overhead — The transformed input tiles are larger than the original tiles (tile size = m + r - 1), increasing intermediate memory requirements.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment