Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo Structural Similarity

From Leeroopedia


Knowledge Sources
Domains Image_Quality_Assessment, Loss_Functions, Computer_Vision
Last Updated 2026-02-10 00:00 GMT

Overview

Structural Similarity (SSIM) quantifies image quality by comparing local patterns of luminance, contrast, and structure, providing a perceptually meaningful alternative to pixel-wise error metrics.

Description

The Structural Similarity Index Measure (SSIM) was designed to model the human visual system's sensitivity to structural information in images. Unlike simple metrics such as Mean Squared Error (MSE) or Peak Signal-to-Noise Ratio (PSNR), which treat all pixel differences equally, SSIM evaluates three perceptual components independently:

  • Luminance: Compares the mean intensities of local patches, capturing overall brightness similarity.
  • Contrast: Compares the standard deviations of local patches, capturing the dynamic range of local intensity variations.
  • Structure: Compares the normalized patterns (after removing mean and scaling by standard deviation), capturing the correlation of local structures.

These components are combined multiplicatively into a single scalar between -1 and 1, where 1 indicates identical images. Local statistics are computed using a Gaussian-weighted window (typically 11x11 with sigma=1.5) that smoothly weights contributions from neighboring pixels.

Multi-Scale SSIM (MS-SSIM) extends single-scale SSIM by evaluating structural similarity at multiple resolutions. The original image pair is repeatedly downsampled by factor 2, and SSIM components are computed at each scale. Contrast sensitivity (CS) values from coarser scales capture low-frequency structural patterns, while the full SSIM at the finest scale captures high-frequency details. The final MS-SSIM score is a weighted product across scales, using empirically determined weights that reflect the relative importance of each resolution.

Usage

Use SSIM and MS-SSIM for evaluating image and video generation quality, comparing frame interpolation results against ground truth, and as a perceptually-motivated training loss. MS-SSIM is preferred when images contain content at varying scales of detail.

Theoretical Basis

For two image patches x and y, the SSIM components are defined as:

Luminance comparison: l(x, y) = (2 * mu_x * mu_y + C1) / (mu_x^2 + mu_y^2 + C1)

Contrast comparison: c(x, y) = (2 * sigma_x * sigma_y + C2) / (sigma_x^2 + sigma_y^2 + C2)

Structure comparison: s(x, y) = (sigma_xy + C3) / (sigma_x * sigma_y + C3)

where:

  • mu_x, mu_y are local weighted means
  • sigma_x, sigma_y are local weighted standard deviations
  • sigma_xy is the local weighted cross-covariance
  • C1 = (K1 * L)^2, C2 = (K2 * L)^2, C3 = C2 / 2 are stability constants
  • K1 = 0.01, K2 = 0.03, and L is the dynamic range

The full SSIM combines these as:

SSIM(x, y) = l(x, y) * c(x, y) * s(x, y)

which simplifies to:

SSIM(x, y) = (2*mu_x*mu_y + C1)(2*sigma_xy + C2) / ((mu_x^2 + mu_y^2 + C1)(sigma_x^2 + sigma_y^2 + C2))

Multi-Scale SSIM at M scales with weights w_1, ..., w_M:

MS-SSIM(x, y) = l_M(x, y)^(w_M) * prod_{j=1}^{M} c_j(x, y)^(w_j) * s_j(x, y)^(w_j)

The standard 5-scale weights are: [0.0448, 0.2856, 0.3001, 0.2363, 0.1333], derived from psychophysical experiments on contrast sensitivity at different viewing distances.

Local statistics are computed via convolution with a Gaussian window w:

mu_x = sum(w * x) sigma_x^2 = sum(w * x^2) - mu_x^2 sigma_xy = sum(w * x * y) - mu_x * mu_y

For use as a training loss, the dissimilarity form is preferred:

L_SSIM = (1 - SSIM) / 2

This maps the SSIM range [-1, 1] to [0, 1], with 0 representing perfect similarity.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment