Heuristic:Kornia Kornia CPU GPU Branching Tip
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Computer_Vision |
| Last Updated | 2026-02-09 15:00 GMT |
Overview
Performance optimization technique that branches between einsum (CPU) and conv2d (GPU) for linear color transformations, providing optimal throughput on each device.
Description
Kornia implements device-aware branching in performance-critical operations. Empirical benchmarks revealed that `torch.einsum` is faster on CPU for 3x3 linear transformations along the channel dimension, while `F.conv2d` with a 1x1 kernel offers significant speedups on GPU/CUDA. Rather than using a single implementation, Kornia checks `image.device.type` at runtime and dispatches to the optimal code path. This pattern applies to all color space conversions that use the `_apply_linear_transformation` internal utility.
Usage
Apply this heuristic when implementing linear channel-wise operations (color conversions, channel mixing) that need to run efficiently on both CPU and GPU. If you are adding a new color transformation or channel operation, use the `_apply_linear_transformation` utility from `kornia.color.utils` which already implements this branching.
The Insight (Rule of Thumb)
- Action: Check `tensor.device.type` and use `torch.einsum` on CPU, `F.conv2d` on GPU for 3x3 channel transformations.
- Value: CPU path uses `torch.einsum("oi, ...ihw -> ...ohw", kernel, image)`. GPU path reshapes for conv2d with a `(3, 3, 1, 1)` kernel.
- Trade-off: Adds branching complexity but eliminates device-specific performance penalties. The einsum output must call `.contiguous()` on CPU.
Reasoning
The performance difference stems from how PyTorch dispatches these operations internally. On CPU, einsum can fuse the operation into a single BLAS call without reshape overhead. On GPU, conv2d leverages cuDNN's optimized kernels for spatial operations, which are faster than the generic einsum dispatch even for 1x1 convolutions. This was validated empirically by the Kornia team. Additionally, integer inputs are cast to float before processing, and kernel dtype is matched to the image dtype to propagate float64 correctly.
Code Evidence
Device-aware branching from `kornia/color/utils.py:46-69`:
# Empirical benchmarks show that einsum is faster on CPU for this specific pattern,
# while conv2d offers significant speedups on GPU/CUDA.
# We branch to ensure optimal performance on both devices.
# BRANCH 1: CPU (Einsum)
if image.device.type == "cpu":
out = torch.einsum("oi, ...ihw -> ...ohw", kernel, image)
if bias is not None:
out = out + bias.view(-1, 1, 1)
return out.contiguous()
# BRANCH 2: GPU/Accelerators (Conv2d)
else:
input_flat = image_compute.reshape(-1, 3, input_shape[-2], input_shape[-1])
weight = kernel_compute.view(3, 3, 1, 1)
out_flat = F.conv2d(input_flat, weight, bias=bias)
out = out_flat.reshape(input_shape)
Integer input handling from `kornia/color/utils.py:36-43`:
# Handle Integer inputs by casting to float safely
if image.is_floating_point():
image_compute = image
else:
image_compute = image.float()
# Match kernel dtype to the image (propagates float64 if needed)
kernel_compute = kernel.to(dtype=image_compute.dtype, device=image_compute.device)