Implementation:Zai org CogVideo LPIPS
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Perceptual_Loss |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implements the LPIPS (Learned Perceptual Image Patch Similarity) metric using a pretrained VGG16 backbone to measure perceptual distance between images in a way that correlates with human judgment.
Description
This module provides the core LPIPS implementation, adapted from the original repository. It consists of several classes and utility functions:
LPIPS-- The main metric class. It extracts features from five VGG16 layers (relu1_2 through relu5_3), normalizes them along the channel dimension, computes squared differences between feature maps of two input images, and applies learned linear weighting layers (NetLinLayer) to produce a scalar perceptual distance. All parameters are frozen after loading pretrained weights.ScalingLayer-- Normalizes input images from [0, 1] range to the VGG-expected distribution using fixed shift and scale buffers.NetLinLayer-- A 1x1 convolution (optionally preceded by dropout) that learns to weight the contribution of each VGG feature layer to the final perceptual distance.vgg16-- A wrapper aroundtorchvision.models.vgg16that exposes five intermediate feature maps as named outputs (relu1_2, relu2_2, relu3_3, relu4_3, relu5_3) by slicing the pretrained feature extractor.normalize_tensor()-- L2-normalizes a tensor along the channel dimension.spatial_average()-- Computes the mean over spatial dimensions (H, W).
Usage
Used as the perceptual loss function throughout the autoencoding module. Referenced by GeneralLPIPSWithDiscriminator for adversarial autoencoder training and by LatentLPIPS for latent-space perceptual loss computation. Can also be used standalone as a perceptual similarity metric for evaluation.
Code Reference
Source Location
- Repository: Zai_org_CogVideo
- File: sat/sgm/modules/autoencoding/lpips/loss/lpips.py
Signature
class LPIPS(nn.Module):
def __init__(self, use_dropout=True)
def load_from_pretrained(self, name="vgg_lpips")
@classmethod
def from_pretrained(cls, name="vgg_lpips") -> "LPIPS"
def forward(self, input, target) -> torch.Tensor
class ScalingLayer(nn.Module):
def __init__(self)
def forward(self, inp) -> torch.Tensor
class NetLinLayer(nn.Module):
def __init__(self, chn_in, chn_out=1, use_dropout=False)
class vgg16(nn.Module):
def __init__(self, requires_grad=False, pretrained=True)
def forward(self, X) -> VggOutputs
def normalize_tensor(x, eps=1e-10) -> torch.Tensor
def spatial_average(x, keepdim=True) -> torch.Tensor
Import
from sat.sgm.modules.autoencoding.lpips.loss.lpips import LPIPS
I/O Contract
Inputs (LPIPS.forward)
| Name | Type | Required | Description |
|---|---|---|---|
| input | torch.Tensor |
Yes | First image tensor of shape [B, 3, H, W], values in [0, 1] or [-1, 1]
|
| target | torch.Tensor |
Yes | Second image tensor of shape [B, 3, H, W], same range as input
|
Outputs (LPIPS.forward)
| Name | Type | Description |
|---|---|---|
| distance | torch.Tensor |
Perceptual distance of shape [B, 1, 1, 1]; lower values indicate higher similarity
|
Usage Examples
from sat.sgm.modules.autoencoding.lpips.loss.lpips import LPIPS
# Initialize frozen LPIPS metric
lpips = LPIPS(use_dropout=True).eval()
# Compute perceptual distance between two images
distance = lpips(image_a, image_b) # shape: [B, 1, 1, 1]
print(f"Perceptual distance: {distance.mean().item():.4f}")
# Use as a loss function during training
perceptual_loss = distance.mean()
total_loss = reconstruction_loss + 0.5 * perceptual_loss