Principle:Zai org CogVideo Structural Similarity

Knowledge Sources	Image Quality Assessment: From Error Visibility to Structural Similarity Multi-Scale Structural Similarity for Image Quality Assessment
Domains	Image_Quality_Assessment, Loss_Functions, Computer_Vision
Last Updated	2026-02-10 00:00 GMT

Overview

Structural Similarity (SSIM) quantifies image quality by comparing local patterns of luminance, contrast, and structure, providing a perceptually meaningful alternative to pixel-wise error metrics.

Description

The Structural Similarity Index Measure (SSIM) was designed to model the human visual system's sensitivity to structural information in images. Unlike simple metrics such as Mean Squared Error (MSE) or Peak Signal-to-Noise Ratio (PSNR), which treat all pixel differences equally, SSIM evaluates three perceptual components independently:

Luminance: Compares the mean intensities of local patches, capturing overall brightness similarity.
Contrast: Compares the standard deviations of local patches, capturing the dynamic range of local intensity variations.
Structure: Compares the normalized patterns (after removing mean and scaling by standard deviation), capturing the correlation of local structures.

These components are combined multiplicatively into a single scalar between -1 and 1, where 1 indicates identical images. Local statistics are computed using a Gaussian-weighted window (typically 11x11 with sigma=1.5) that smoothly weights contributions from neighboring pixels.

Multi-Scale SSIM (MS-SSIM) extends single-scale SSIM by evaluating structural similarity at multiple resolutions. The original image pair is repeatedly downsampled by factor 2, and SSIM components are computed at each scale. Contrast sensitivity (CS) values from coarser scales capture low-frequency structural patterns, while the full SSIM at the finest scale captures high-frequency details. The final MS-SSIM score is a weighted product across scales, using empirically determined weights that reflect the relative importance of each resolution.

Usage

Use SSIM and MS-SSIM for evaluating image and video generation quality, comparing frame interpolation results against ground truth, and as a perceptually-motivated training loss. MS-SSIM is preferred when images contain content at varying scales of detail.

Theoretical Basis

For two image patches x and y, the SSIM components are defined as:

Luminance comparison: l(x, y) = (2 * mu_x * mu_y + C1) / (mu_x^2 + mu_y^2 + C1)

Contrast comparison: c(x, y) = (2 * sigma_x * sigma_y + C2) / (sigma_x^2 + sigma_y^2 + C2)

Structure comparison: s(x, y) = (sigma_xy + C3) / (sigma_x * sigma_y + C3)

where:

mu_x, mu_y are local weighted means
sigma_x, sigma_y are local weighted standard deviations
sigma_xy is the local weighted cross-covariance
C1 = (K1 * L)^2, C2 = (K2 * L)^2, C3 = C2 / 2 are stability constants
K1 = 0.01, K2 = 0.03, and L is the dynamic range

The full SSIM combines these as:

SSIM(x, y) = l(x, y) * c(x, y) * s(x, y)

which simplifies to:

SSIM(x, y) = (2*mu_x*mu_y + C1)(2*sigma_xy + C2) / ((mu_x^2 + mu_y^2 + C1)(sigma_x^2 + sigma_y^2 + C2))

Multi-Scale SSIM at M scales with weights w_1, ..., w_M:

MS-SSIM(x, y) = l_M(x, y)^(w_M) * prod_{j=1}^{M} c_j(x, y)^(w_j) * s_j(x, y)^(w_j)

The standard 5-scale weights are: [0.0448, 0.2856, 0.3001, 0.2363, 0.1333], derived from psychophysical experiments on contrast sensitivity at different viewing distances.

Local statistics are computed via convolution with a Gaussian window w:

mu_x = sum(w * x) sigma_x^2 = sum(w * x^2) - mu_x^2 sigma_xy = sum(w * x * y) - mu_x * mu_y

For use as a training loss, the dissimilarity form is preferred:

L_SSIM = (1 - SSIM) / 2

This maps the SSIM range [-1, 1] to [0, 1], with 0 representing perfect similarity.

Related Pages

Implementation:Zai_org_CogVideo_SSIM_MS_SSIM

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment