Heuristic:Kornia Kornia CPU GPU Branching Tip

Knowledge Sources	Kornia Empirical benchmarking in kornia/color/utils.py
Domains	Optimization, Computer_Vision
Last Updated	2026-02-09 15:00 GMT

Overview

Performance optimization technique that branches between einsum (CPU) and conv2d (GPU) for linear color transformations, providing optimal throughput on each device.

Description

Kornia implements device-aware branching in performance-critical operations. Empirical benchmarks revealed that `torch.einsum` is faster on CPU for 3x3 linear transformations along the channel dimension, while `F.conv2d` with a 1x1 kernel offers significant speedups on GPU/CUDA. Rather than using a single implementation, Kornia checks `image.device.type` at runtime and dispatches to the optimal code path. This pattern applies to all color space conversions that use the `_apply_linear_transformation` internal utility.

Usage

Apply this heuristic when implementing linear channel-wise operations (color conversions, channel mixing) that need to run efficiently on both CPU and GPU. If you are adding a new color transformation or channel operation, use the `_apply_linear_transformation` utility from `kornia.color.utils` which already implements this branching.

The Insight (Rule of Thumb)

Action: Check `tensor.device.type` and use `torch.einsum` on CPU, `F.conv2d` on GPU for 3x3 channel transformations.
Value: CPU path uses `torch.einsum("oi, ...ihw -> ...ohw", kernel, image)`. GPU path reshapes for conv2d with a `(3, 3, 1, 1)` kernel.
Trade-off: Adds branching complexity but eliminates device-specific performance penalties. The einsum output must call `.contiguous()` on CPU.

Reasoning

The performance difference stems from how PyTorch dispatches these operations internally. On CPU, einsum can fuse the operation into a single BLAS call without reshape overhead. On GPU, conv2d leverages cuDNN's optimized kernels for spatial operations, which are faster than the generic einsum dispatch even for 1x1 convolutions. This was validated empirically by the Kornia team. Additionally, integer inputs are cast to float before processing, and kernel dtype is matched to the image dtype to propagate float64 correctly.

Code Evidence

Device-aware branching from `kornia/color/utils.py:46-69`:

# Empirical benchmarks show that einsum is faster on CPU for this specific pattern,
# while conv2d offers significant speedups on GPU/CUDA.
# We branch to ensure optimal performance on both devices.
# BRANCH 1: CPU (Einsum)
if image.device.type == "cpu":
    out = torch.einsum("oi, ...ihw -> ...ohw", kernel, image)
    if bias is not None:
        out = out + bias.view(-1, 1, 1)
    return out.contiguous()

# BRANCH 2: GPU/Accelerators (Conv2d)
else:
    input_flat = image_compute.reshape(-1, 3, input_shape[-2], input_shape[-1])
    weight = kernel_compute.view(3, 3, 1, 1)
    out_flat = F.conv2d(input_flat, weight, bias=bias)
    out = out_flat.reshape(input_shape)

Integer input handling from `kornia/color/utils.py:36-43`:

# Handle Integer inputs by casting to float safely
if image.is_floating_point():
    image_compute = image
else:
    image_compute = image.float()

# Match kernel dtype to the image (propagates float64 if needed)
kernel_compute = kernel.to(dtype=image_compute.dtype, device=image_compute.device)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment