Implementation:Zai org CogVideo IFNet HDv3

Knowledge Sources	Zai_org_CogVideo
Domains	Video_Generation, Optical_Flow, Frame_Interpolation
Last Updated	2026-02-10 00:00 GMT

Overview

IFNet HDv3 is a lightweight variant of the Intermediate Flow Network that uses symmetric bidirectional flow averaging and omits Contextnet/Unet refinement for faster HD video frame interpolation.

Description

IFNet HDv3 implements the RIFE HD v3 architecture, a streamlined flow estimation network optimized for high-definition video. Unlike the original IFNet, this variant processes flow estimation symmetrically by running each IFBlock twice per scale -- once for each temporal direction -- and averaging the bidirectional results. This symmetric processing enforces temporal consistency without requiring a separate teacher network.

Each IFBlock uses four pairs of residual convolution blocks (instead of eight sequential blocks in the original), followed by two separate transposed-convolution heads: conv1 producing 4-channel optical flow and conv2 producing a 1-channel blending mask. The block takes concatenated warped frames and accumulated flow/mask as input, applies bilinear interpolation for multi-scale processing with explicit recompute_scale_factor=False, and outputs residual flow and mask corrections.

The IFNet forward pass initializes flow and mask as zero tensors, then iterates through three scales. At each scale, the block is called twice with swapped frame ordering and negated mask, and the bidirectional results are averaged:

flow = flow + (f0 + swap(f1)) / 2

The Contextnet and Unet refinement modules are commented out, so the output relies purely on flow-based warping and mask blending.

Usage

Use IFNet HDv3 as the flow estimation backbone for the RIFE HD v3 pipeline. This is the variant actually employed in the Gradio composite demo for video frame interpolation, offering faster inference compared to teacher-distillation variants.

Code Reference

Source Location

Repository: Zai_org_CogVideo
File: inference/gradio_composite_demo/rife/IFNet_HDv3.py

Signature

class IFBlock(nn.Module):
    def __init__(self, in_planes, c=64)
    def forward(self, x, flow, scale=1) -> Tuple[torch.Tensor, torch.Tensor]

class IFNet(nn.Module):
    def __init__(self)
    def forward(self, x, scale_list=[4, 2, 1], training=False) -> Tuple[list, torch.Tensor, list]

Import

from inference.gradio_composite_demo.rife.IFNet_HDv3 import IFNet, IFBlock

I/O Contract

Inputs

IFNet.forward:

Name	Type	Required	Description
x	torch.Tensor	Yes	Concatenated input frames along the channel dimension. Shape: (B, 2*C, H, W), where C is the number of image channels. The tensor is split in half to obtain img0 and img1
scale_list	list[int]	No	Multi-scale factors for the three IFBlock stages, default [4, 2, 1]
training	bool	No	Training mode flag, default False. When False, automatically splits x into two equal halves

IFBlock.forward:

Name	Type	Required	Description
x	torch.Tensor	Yes	Concatenated warped image features and mask
flow	torch.Tensor	Yes	Accumulated optical flow tensor of shape (B, 4, H, W)
scale	int	No	Current scale factor for bilinear interpolation, default 1

Outputs

IFNet.forward:

Name	Type	Description
flow_list	list[torch.Tensor]	Optical flow fields at each scale, each of shape (B, 4, H, W)
mask	torch.Tensor	Final sigmoid-activated blending mask of shape (B, 1, H, W)
merged	list[torch.Tensor]	Blended interpolated frames at each scale, each of shape (B, C, H, W)

Usage Examples

import torch
from inference.gradio_composite_demo.rife.IFNet_HDv3 import IFNet

model = IFNet()
model.eval()

# Concatenate two input frames along channel dimension
img0 = torch.randn(1, 3, 720, 1280)  # HD frame
img1 = torch.randn(1, 3, 720, 1280)
x = torch.cat((img0, img1), dim=1)   # (1, 6, 720, 1280)

with torch.no_grad():
    flow_list, mask, merged = model(x, scale_list=[4, 2, 1])
    interpolated_frame = merged[2]  # Final scale result: (1, 3, 720, 1280)

Related Pages

Principle:Zai_org_CogVideo_Optical_Flow_Estimation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment