Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Zai org CogVideo LPIPS

From Leeroopedia


Knowledge Sources
Domains Video_Generation, Perceptual_Loss
Last Updated 2026-02-10 00:00 GMT

Overview

Implements the LPIPS (Learned Perceptual Image Patch Similarity) metric using a pretrained VGG16 backbone to measure perceptual distance between images in a way that correlates with human judgment.

Description

This module provides the core LPIPS implementation, adapted from the original repository. It consists of several classes and utility functions:

  • LPIPS -- The main metric class. It extracts features from five VGG16 layers (relu1_2 through relu5_3), normalizes them along the channel dimension, computes squared differences between feature maps of two input images, and applies learned linear weighting layers (NetLinLayer) to produce a scalar perceptual distance. All parameters are frozen after loading pretrained weights.
  • ScalingLayer -- Normalizes input images from [0, 1] range to the VGG-expected distribution using fixed shift and scale buffers.
  • NetLinLayer -- A 1x1 convolution (optionally preceded by dropout) that learns to weight the contribution of each VGG feature layer to the final perceptual distance.
  • vgg16 -- A wrapper around torchvision.models.vgg16 that exposes five intermediate feature maps as named outputs (relu1_2, relu2_2, relu3_3, relu4_3, relu5_3) by slicing the pretrained feature extractor.
  • normalize_tensor() -- L2-normalizes a tensor along the channel dimension.
  • spatial_average() -- Computes the mean over spatial dimensions (H, W).

Usage

Used as the perceptual loss function throughout the autoencoding module. Referenced by GeneralLPIPSWithDiscriminator for adversarial autoencoder training and by LatentLPIPS for latent-space perceptual loss computation. Can also be used standalone as a perceptual similarity metric for evaluation.

Code Reference

Source Location

  • Repository: Zai_org_CogVideo
  • File: sat/sgm/modules/autoencoding/lpips/loss/lpips.py

Signature

class LPIPS(nn.Module):
    def __init__(self, use_dropout=True)
    def load_from_pretrained(self, name="vgg_lpips")
    @classmethod
    def from_pretrained(cls, name="vgg_lpips") -> "LPIPS"
    def forward(self, input, target) -> torch.Tensor

class ScalingLayer(nn.Module):
    def __init__(self)
    def forward(self, inp) -> torch.Tensor

class NetLinLayer(nn.Module):
    def __init__(self, chn_in, chn_out=1, use_dropout=False)

class vgg16(nn.Module):
    def __init__(self, requires_grad=False, pretrained=True)
    def forward(self, X) -> VggOutputs

def normalize_tensor(x, eps=1e-10) -> torch.Tensor
def spatial_average(x, keepdim=True) -> torch.Tensor

Import

from sat.sgm.modules.autoencoding.lpips.loss.lpips import LPIPS

I/O Contract

Inputs (LPIPS.forward)

Name Type Required Description
input torch.Tensor Yes First image tensor of shape [B, 3, H, W], values in [0, 1] or [-1, 1]
target torch.Tensor Yes Second image tensor of shape [B, 3, H, W], same range as input

Outputs (LPIPS.forward)

Name Type Description
distance torch.Tensor Perceptual distance of shape [B, 1, 1, 1]; lower values indicate higher similarity

Usage Examples

from sat.sgm.modules.autoencoding.lpips.loss.lpips import LPIPS

# Initialize frozen LPIPS metric
lpips = LPIPS(use_dropout=True).eval()

# Compute perceptual distance between two images
distance = lpips(image_a, image_b)  # shape: [B, 1, 1, 1]
print(f"Perceptual distance: {distance.mean().item():.4f}")

# Use as a loss function during training
perceptual_loss = distance.mean()
total_loss = reconstruction_loss + 0.5 * perceptual_loss

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment