Principle:Zai org CogVideo Structural Similarity
| Knowledge Sources | |
|---|---|
| Domains | Image_Quality_Assessment, Loss_Functions, Computer_Vision |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Structural Similarity (SSIM) quantifies image quality by comparing local patterns of luminance, contrast, and structure, providing a perceptually meaningful alternative to pixel-wise error metrics.
Description
The Structural Similarity Index Measure (SSIM) was designed to model the human visual system's sensitivity to structural information in images. Unlike simple metrics such as Mean Squared Error (MSE) or Peak Signal-to-Noise Ratio (PSNR), which treat all pixel differences equally, SSIM evaluates three perceptual components independently:
- Luminance: Compares the mean intensities of local patches, capturing overall brightness similarity.
- Contrast: Compares the standard deviations of local patches, capturing the dynamic range of local intensity variations.
- Structure: Compares the normalized patterns (after removing mean and scaling by standard deviation), capturing the correlation of local structures.
These components are combined multiplicatively into a single scalar between -1 and 1, where 1 indicates identical images. Local statistics are computed using a Gaussian-weighted window (typically 11x11 with sigma=1.5) that smoothly weights contributions from neighboring pixels.
Multi-Scale SSIM (MS-SSIM) extends single-scale SSIM by evaluating structural similarity at multiple resolutions. The original image pair is repeatedly downsampled by factor 2, and SSIM components are computed at each scale. Contrast sensitivity (CS) values from coarser scales capture low-frequency structural patterns, while the full SSIM at the finest scale captures high-frequency details. The final MS-SSIM score is a weighted product across scales, using empirically determined weights that reflect the relative importance of each resolution.
Usage
Use SSIM and MS-SSIM for evaluating image and video generation quality, comparing frame interpolation results against ground truth, and as a perceptually-motivated training loss. MS-SSIM is preferred when images contain content at varying scales of detail.
Theoretical Basis
For two image patches x and y, the SSIM components are defined as:
Luminance comparison:
l(x, y) = (2 * mu_x * mu_y + C1) / (mu_x^2 + mu_y^2 + C1)
Contrast comparison:
c(x, y) = (2 * sigma_x * sigma_y + C2) / (sigma_x^2 + sigma_y^2 + C2)
Structure comparison:
s(x, y) = (sigma_xy + C3) / (sigma_x * sigma_y + C3)
where:
- mu_x, mu_y are local weighted means
- sigma_x, sigma_y are local weighted standard deviations
- sigma_xy is the local weighted cross-covariance
- C1 = (K1 * L)^2, C2 = (K2 * L)^2, C3 = C2 / 2 are stability constants
- K1 = 0.01, K2 = 0.03, and L is the dynamic range
The full SSIM combines these as:
SSIM(x, y) = l(x, y) * c(x, y) * s(x, y)
which simplifies to:
SSIM(x, y) = (2*mu_x*mu_y + C1)(2*sigma_xy + C2) / ((mu_x^2 + mu_y^2 + C1)(sigma_x^2 + sigma_y^2 + C2))
Multi-Scale SSIM at M scales with weights w_1, ..., w_M:
MS-SSIM(x, y) = l_M(x, y)^(w_M) * prod_{j=1}^{M} c_j(x, y)^(w_j) * s_j(x, y)^(w_j)
The standard 5-scale weights are: [0.0448, 0.2856, 0.3001, 0.2363, 0.1333], derived from psychophysical experiments on contrast sensitivity at different viewing distances.
Local statistics are computed via convolution with a Gaussian window w:
mu_x = sum(w * x)
sigma_x^2 = sum(w * x^2) - mu_x^2
sigma_xy = sum(w * x * y) - mu_x * mu_y
For use as a training loss, the dissimilarity form is preferred:
L_SSIM = (1 - SSIM) / 2
This maps the SSIM range [-1, 1] to [0, 1], with 0 representing perfect similarity.