Implementation:Zai org CogVideo LPIPS

Knowledge Sources	Zai_org_CogVideo
Domains	Video_Generation, Perceptual_Loss
Last Updated	2026-02-10 00:00 GMT

Overview

Implements the LPIPS (Learned Perceptual Image Patch Similarity) metric using a pretrained VGG16 backbone to measure perceptual distance between images in a way that correlates with human judgment.

Description

This module provides the core LPIPS implementation, adapted from the original repository. It consists of several classes and utility functions:

LPIPS -- The main metric class. It extracts features from five VGG16 layers (relu1_2 through relu5_3), normalizes them along the channel dimension, computes squared differences between feature maps of two input images, and applies learned linear weighting layers (NetLinLayer) to produce a scalar perceptual distance. All parameters are frozen after loading pretrained weights.
ScalingLayer -- Normalizes input images from [0, 1] range to the VGG-expected distribution using fixed shift and scale buffers.
NetLinLayer -- A 1x1 convolution (optionally preceded by dropout) that learns to weight the contribution of each VGG feature layer to the final perceptual distance.
vgg16 -- A wrapper around torchvision.models.vgg16 that exposes five intermediate feature maps as named outputs (relu1_2, relu2_2, relu3_3, relu4_3, relu5_3) by slicing the pretrained feature extractor.
normalize_tensor() -- L2-normalizes a tensor along the channel dimension.
spatial_average() -- Computes the mean over spatial dimensions (H, W).

Usage

Used as the perceptual loss function throughout the autoencoding module. Referenced by GeneralLPIPSWithDiscriminator for adversarial autoencoder training and by LatentLPIPS for latent-space perceptual loss computation. Can also be used standalone as a perceptual similarity metric for evaluation.

Code Reference

Source Location

Repository: Zai_org_CogVideo
File: sat/sgm/modules/autoencoding/lpips/loss/lpips.py

Signature

class LPIPS(nn.Module):
    def __init__(self, use_dropout=True)
    def load_from_pretrained(self, name="vgg_lpips")
    @classmethod
    def from_pretrained(cls, name="vgg_lpips") -> "LPIPS"
    def forward(self, input, target) -> torch.Tensor

class ScalingLayer(nn.Module):
    def __init__(self)
    def forward(self, inp) -> torch.Tensor

class NetLinLayer(nn.Module):
    def __init__(self, chn_in, chn_out=1, use_dropout=False)

class vgg16(nn.Module):
    def __init__(self, requires_grad=False, pretrained=True)
    def forward(self, X) -> VggOutputs

def normalize_tensor(x, eps=1e-10) -> torch.Tensor
def spatial_average(x, keepdim=True) -> torch.Tensor

Import

from sat.sgm.modules.autoencoding.lpips.loss.lpips import LPIPS

I/O Contract

Inputs (LPIPS.forward)

Name	Type	Required	Description
input	`torch.Tensor`	Yes	First image tensor of shape `[B, 3, H, W]`, values in [0, 1] or [-1, 1]
target	`torch.Tensor`	Yes	Second image tensor of shape `[B, 3, H, W]`, same range as input

Outputs (LPIPS.forward)

Name	Type	Description
distance	`torch.Tensor`	Perceptual distance of shape `[B, 1, 1, 1]`; lower values indicate higher similarity

Usage Examples

from sat.sgm.modules.autoencoding.lpips.loss.lpips import LPIPS

# Initialize frozen LPIPS metric
lpips = LPIPS(use_dropout=True).eval()

# Compute perceptual distance between two images
distance = lpips(image_a, image_b)  # shape: [B, 1, 1, 1]
print(f"Perceptual distance: {distance.mean().item():.4f}")

# Use as a loss function during training
perceptual_loss = distance.mean()
total_loss = reconstruction_loss + 0.5 * perceptual_loss

Related Pages

Principle:Zai_org_CogVideo_Learned_Perceptual_Similarity

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment