Principle:Zai org CogVideo Learned Perceptual Similarity

Knowledge Sources	The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
Domains	Perceptual_Loss, Image_Quality_Assessment
Last Updated	2026-02-10 00:00 GMT

Overview

Learned Perceptual Image Patch Similarity (LPIPS) measures the perceptual distance between two images by comparing their deep feature representations through a pretrained network with learned per-layer weighting, producing scores that align closely with human visual judgment.

Description

Traditional pixel-level metrics such as MSE, PSNR, and even structural similarity (SSIM) often fail to capture the perceptual differences that humans notice between images. Two images can have low pixel-level distance yet appear very different to a human observer (e.g., a slight spatial shift), or have high pixel-level distance yet appear nearly identical (e.g., imperceptible texture changes).

LPIPS addresses this by leveraging the internal representations of deep neural networks trained on image classification (typically VGG or AlexNet). The key insight is that intermediate layers of these networks encode progressively more abstract visual features -- from edges and textures in early layers to shapes and semantic content in deeper layers. By comparing images in this multi-scale feature space rather than pixel space, the metric captures perceptual differences at multiple levels of abstraction.

The "learned" aspect is critical: rather than simply computing Euclidean distance in feature space (which already improves over pixel metrics), LPIPS trains a set of linear weights that optimally combine distances from different network layers. These weights are calibrated against a large dataset of human perceptual judgments (the BAPPS dataset) using a two-alternative forced choice paradigm.

Usage

Use LPIPS whenever a loss function or evaluation metric must correlate with human visual perception. Common applications include:

Training generative models -- as a perceptual loss term to encourage visually pleasing outputs
Evaluating image reconstruction -- as a quality metric that better reflects human judgment than PSNR/SSIM
Image retrieval and comparison -- for finding visually similar images in a perceptually meaningful way

Theoretical Basis

Given two images $x, x_{0}$ , the LPIPS distance is computed as follows:

Step 1: Feature extraction. Pass both images through a pretrained network $ℱ$ (e.g., VGG16) and extract activations from $L$ selected layers:

${\hat{y}}^{l}, {\hat{y}}_{0}^{l} = ℱ^{l} (x), ℱ^{l} (x_{0}) for l = 1, \dots, L$

Step 2: Channel normalization. Unit-normalize the feature activations along the channel dimension:

${\hat{y}}_{norm}^{l} (h, w) = \frac{{\hat{y}}^{l} (h, w)}{‖ {\hat{y}}^{l} (h, w) ‖_{2} + ϵ}$

Step 3: Squared difference. Compute the element-wise squared difference:

$d^{l} (h, w) = ({\hat{y}}_{norm}^{l} (h, w) - {\hat{y}}_{0, norm}^{l} (h, w))^{2}$

Step 4: Learned linear weighting. Apply learned per-layer weights $w_{l}$ (implemented as 1x1 convolutions) and spatially average:

$LPIPS (x, x_{0}) = \sum_{l = 1}^{L} \frac{1}{H_{l} W_{l}} \sum_{h, w} w_{l}^{T} \cdot d^{l} (h, w)$

The weights $w_{l}$ are trained to minimize a loss that aligns the metric's ordering of image pairs with human perceptual judgments. After training, all network parameters (both the backbone and the linear weights) are frozen for use as a fixed metric.

For VGG16 specifically, the five layers correspond to activations after relu1_2, relu2_2, relu3_3, relu4_3, and relu5_3, with channel dimensions [64, 128, 256, 512, 512].

Related Pages

Implementation:Zai_org_CogVideo_LPIPS

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment