Principle:Zai org CogVideo Optical Flow Estimation
| Knowledge Sources | |
|---|---|
| Domains | Optical_Flow, Frame_Interpolation, Computer_Vision |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Optical flow estimation computes dense per-pixel displacement vectors between two consecutive video frames, representing the apparent motion of objects across time.
Description
Optical flow is a fundamental computer vision task that establishes pixel-level correspondences between two images. For each pixel in the reference frame, the optical flow field specifies a 2D displacement vector (dx, dy) indicating where that pixel has moved in the target frame. In the context of frame interpolation, intermediate flow estimation computes bidirectional flows from the target intermediate frame to both input frames, enabling direct synthesis of the in-between frame.
Modern neural network approaches to optical flow use coarse-to-fine architectures that estimate flow at multiple spatial scales. The input images are first processed at a coarse (low-resolution) scale to capture large displacements, and the flow is progressively refined at finer scales to recover small motions and sharp boundaries. Each refinement stage takes the current flow estimate, warps the input images accordingly, and predicts a residual correction.
Teacher-student distillation improves flow quality during training. A teacher network receives the ground-truth intermediate frame as additional input, producing a privileged flow estimate. The student network, which only sees the two input frames, is trained to match the teacher's flow through a distillation loss. At inference time, only the student network is used.
Blending masks accompany flow estimates to handle occlusions and disocclusions. A learned mask determines how to weight the contributions of the two warped input frames when synthesizing the intermediate frame, with values near 1 favoring the forward-warped frame and values near 0 favoring the backward-warped frame.
Usage
Use optical flow estimation when synthesizing intermediate frames for video frame rate upsampling, slow-motion generation, temporal super-resolution, or any task requiring dense motion correspondence between video frames.
Theoretical Basis
The classical optical flow constraint assumes brightness constancy:
I(x, y, t) = I(x + dx, y + dy, t + dt)
Taking a first-order Taylor expansion yields the optical flow equation:
Ix * u + Iy * v + It = 0
where Ix, Iy are spatial gradients, It is the temporal gradient, and (u, v) is the flow vector.
In the intermediate flow formulation for frame interpolation at timestep t in [0, 1], the bidirectional flows F_t->0 and F_t->1 relate to the forward flow F_0->1 approximately as:
F_t->0 = -(1-t) * t * F_0->1 + t^2 * F_1->0
F_t->1 = (1-t)^2 * F_0->1 - t * (1-t) * F_1->0
The intermediate frame is synthesized by warping and blending:
I_t = M * warp(I_0, F_t->0) + (1 - M) * warp(I_1, F_t->1)
where M is the learned blending mask and warp denotes backward warping using bilinear interpolation.
The multi-scale refinement formulates flow estimation as iterative residual learning:
F^(s) = F^(s-1) + delta_F^(s)
where each stage s predicts a residual flow correction delta_F^(s) conditioned on the warped images at the current flow estimate.