Principle:Alibaba MNN Winograd Kernel Codegen
| Property | Value |
|---|---|
| Page Type | Principle |
| Domains | Deep_Learning, Convolution, GPU_Computing |
| Date | 2026-02-10 |
Overview
Winograd Kernel Codegen describes the theoretical foundation and design rationale behind the automated generation of GPU kernel source code for Winograd-based fast convolution. In the MNN framework, dedicated code generation tools produce target-specific GPU kernels (OpenCL and GLSL) that implement the Winograd transform, allowing the inference engine to exploit the algorithm's reduced multiplication count on GPU hardware.
The core insight is that Winograd transform matrices are determined at compile time by the choice of tile size and kernel size. By generating GPU kernel source code with these matrices embedded as literal constants, the approach eliminates runtime matrix construction, enables the GPU compiler to perform constant folding and strength reduction, and produces kernels tailored to each specific F(m, r) configuration.
Knowledge Sources
| Source | Reference | Description |
|---|---|---|
| Primary Paper | Lavin & Gray, 2015. Fast Algorithms for Convolutional Neural Networks. arXiv:1509.09308 | Introduces the application of Winograd minimal filtering to deep learning convolutions, demonstrating significant speedups for small-kernel convolutions. |
| Classical Reference | Winograd, S. Arithmetic Complexity of Computations. SIAM, 1980. | Original formulation of minimal filtering algorithms that form the mathematical basis. |
Theoretical Basis
Winograd Minimal Filtering
The Winograd minimal filtering algorithm computes an m-output convolution with an r-tap filter using only:
F(m, r) requires (m + r - 1) multiplications
compared to the m * r multiplications required by direct convolution. For the commonly used F(2, 3) configuration, this reduces 6 multiplications to 4 — a 1.5x reduction. For F(4, 3), the reduction is from 12 to 6 — a 2x improvement.
Transform Equations
The 1D Winograd convolution of input signal d with filter g is computed as:
Y = A^T [ (G * g) ⊙ (B^T * d) ]
where:
- G is the filter transform matrix — transforms the convolution filter into the Winograd domain
- B^T is the input (source) transform matrix — transforms input tiles via B^T * d * B (for 2D)
- A^T is the output (destination) transform matrix — reconstructs spatial output via A^T * M * A (for 2D)
- ⊙ denotes element-wise (Hadamard) multiplication
For 2D convolution, the transforms become matrix-matrix operations:
| Transform | Formula | Purpose |
|---|---|---|
| Source (Input) Transform | U = B^T * d * B |
Transform each input tile into the Winograd domain |
| Element-wise Product | M = U ⊙ V |
Multiply transformed input with pre-transformed filter (V = G * g * G^T) |
| Destination (Output) Transform | Y = A^T * M * A |
Reconstruct spatial output from Winograd domain products |
Why Code Generation?
The transform matrices B, G, and A are fully determined by the parameters m, r, and the choice of interpolation points. This makes them ideal candidates for compile-time embedding:
- Constant propagation — GPU compilers can fold multiplications by known constants (especially 0, 1, -1) into additions or eliminations.
- Dead code elimination — Transform matrix entries that are zero produce multiply-by-zero terms that the compiler can remove entirely.
- Configuration-specific optimization — Each F(m, r) variant gets a dedicated kernel with no branching or runtime parameterization overhead.
- No runtime matrix storage — Transform matrices do not need to be allocated in GPU memory or passed as kernel arguments.
Supported Configurations
The code generation tools support arbitrary F(m, r) configurations via command-line parameters. Commonly generated configurations include:
| Configuration | Input Tile Size | Multiplications (Winograd) | Multiplications (Direct) | Speedup |
|---|---|---|---|---|
| F(2, 3) | 4x4 | 16 | 36 | 2.25x |
| F(4, 3) | 6x6 | 36 | 144 | 4.0x |
| F(6, 3) | 8x8 | 64 | 324 | 5.06x |
Note: 2D multiplication counts are the square of the 1D counts. Speedup ratios reflect the theoretical arithmetic advantage before considering transform overhead.
GPU Target Considerations
The same Winograd transform mathematics must be expressed in different GPU shading languages, each with distinct memory models and execution semantics:
| Aspect | OpenCL Target | GLSL/Vulkan Target |
|---|---|---|
| File extension | .cl |
.comp
|
| Memory qualifiers | __global, __local |
SSBO with layout(binding=...)
|
| Thread model | Work-items in NDRange | Invocations in workgroups |
| Floating-point | IEEE 754 with relaxed math options | Implementation-defined precision |
Both targets embed the same precomputed transform coefficients; only the surrounding kernel scaffolding and memory access patterns differ.
Implemented By
The following implementations realize this principle in MNN:
| Implementation | Description |
|---|---|
| Alibaba_MNN_WinogradGenerateCL | Command-line tool that generates OpenCL kernel source files (.cl) containing Winograd source and destination transforms.
|
| Alibaba_MNN_WinogradGenerateGLSL | Command-line tool that generates Vulkan GLSL compute shader source files (.comp) containing Winograd source and destination transforms.
|
Related Pages
Trade-offs and Limitations
- Numerical precision — Larger tile sizes (higher m) increase the dynamic range of transform coefficients, which can cause floating-point precision loss, particularly in half-precision (FP16) execution.
- Code size — Each F(m, r) configuration produces a separate kernel; supporting many configurations increases binary size.
- Kernel size restriction — Winograd is most effective for small kernels (3x3, 5x5). For larger kernels, the transform overhead can exceed the multiplication savings.
- Memory overhead — The transformed input tiles are larger than the original tiles (tile size = m + r - 1), increasing intermediate memory requirements.