Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Zai org CogVideo Optical Flow Training Loss

From Leeroopedia
Revision as of 17:38, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Zai_org_CogVideo_Optical_Flow_Training_Loss.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Loss_Functions, Optical_Flow, Frame_Interpolation
Last Updated 2026-02-10 00:00 GMT

Overview

Optical flow training losses provide multi-faceted supervision for flow estimation networks by combining endpoint error, structural matching, edge preservation, and perceptual feature similarity.

Description

Training accurate optical flow and frame interpolation networks requires a diverse set of loss functions, each targeting different aspects of prediction quality. No single loss metric adequately captures all desirable properties, so practical training pipelines combine multiple complementary objectives.

Endpoint Error (EPE) is the standard metric for optical flow accuracy. It measures the Euclidean distance between predicted and ground-truth flow vectors at each pixel. A loss mask limits computation to regions with reliable ground-truth flow, excluding boundaries and occluded areas.

Census Transform Loss captures structural similarity between images in a way that is robust to illumination changes. The census transform describes each pixel's relationship to its local neighborhood: for each pixel in a patch, it records whether the neighbor is brighter or darker. Comparing census-transformed images captures local structural patterns without being sensitive to global brightness shifts. The robust Hamming distance with normalization dist / (0.1 + dist) prevents outliers from dominating the loss.

Sobel Edge Loss ensures that predicted frames preserve sharp edges. By applying horizontal and vertical Sobel gradient operators to both prediction and target, then computing L1 differences, this loss penalizes blurred or shifted edges. Edge preservation is critical for visual quality in interpolated frames.

VGG Perceptual Loss compares images in the feature space of a pretrained classification network. Features at different layers capture different semantic levels: early layers respond to textures and edges, deeper layers capture object parts and scene layout. Weighted L1 differences across multiple layers provide a comprehensive perceptual similarity measure.

Usage

Use these loss functions in combination when training optical flow estimation or frame interpolation networks. The typical weighting strategy emphasizes reconstruction losses (Laplacian pyramid, perceptual) while using flow-specific losses (EPE) and structural losses (census, Sobel) as auxiliary supervision signals.

Theoretical Basis

Endpoint Error:

EPE(F, F*) = sqrt(sum_c (F_c - F*_c)^2 + epsilon)

where F is the predicted flow, F* is the ground-truth flow, c indexes flow channels, and epsilon (1e-6) provides numerical stability.

Census Transform: For a pixel p and its neighborhood N(p) of size k x k:

C(p) = [sign(I(q) - I(p)) for q in N(p)]

The normalized ternary variant replaces the sign function with:

T(p, q) = (I(q) - I(p)) / sqrt(0.81 + (I(q) - I(p))^2)

The robust Hamming distance between two census-transformed patches t1 and t2:

H(t1, t2) = mean((t1 - t2)^2 / (0.1 + (t1 - t2)^2))

This saturates for large differences, making it robust to outliers.

Sobel Edge Detection: The Sobel kernels approximate the image gradient:

G_x = [[1,0,-1],[2,0,-2],[1,0,-1]] * I G_y = [[1,2,1],[0,0,0],[-1,-2,-1]] * I

The edge loss computes: L_edge = |G_x(P) - G_x(T)| + |G_y(P) - G_y(T)|

VGG Perceptual Loss: For a pretrained network phi with feature extraction layers l_1, ..., l_K:

L_perceptual = sum_k w_k * ||phi_l_k(P) - phi_l_k(T)||_1

where w_k are per-layer weights that balance the contribution of different semantic levels. Input images are normalized to ImageNet statistics before feature extraction.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment