Principle:Zai org CogVideo Laplacian Pyramid Loss

Knowledge Sources	The Laplacian Pyramid as a Compact Image Code Loss Functions for Image Restoration with Neural Networks
Domains	Loss_Functions, Image_Processing, Multi_Scale_Analysis
Last Updated	2026-02-10 00:00 GMT

Overview

Laplacian pyramid loss computes image reconstruction error at multiple spatial frequency bands by decomposing images into a multi-scale pyramid and summing per-level differences.

Description

The Laplacian pyramid is a multi-resolution image representation that decomposes an image into a set of bandpass-filtered components at different scales. Each level of the pyramid captures detail at a particular spatial frequency band: the finest level captures high-frequency edges and textures, while coarser levels capture progressively lower-frequency structural information.

As a loss function, the Laplacian pyramid loss compares predicted and target images at each decomposition level independently. This provides several advantages over simple pixel-wise loss:

Frequency-aware supervision: Errors in fine detail (high frequency) and coarse structure (low frequency) are penalized separately, preventing the network from trading off one for the other.
Perceptual relevance: The multi-scale decomposition loosely mirrors the human visual system's sensitivity to different spatial frequencies.
Gradient balance: By comparing at each scale, the loss provides balanced gradients across frequency bands, avoiding the common problem where pixel-wise losses are dominated by low-frequency content.

The construction process involves: (1) Gaussian smoothing with a fixed kernel, (2) downsampling by factor 2, (3) upsampling back to original resolution, and (4) subtracting the upsampled version from the original to obtain the detail band. This is repeated iteratively to build multiple levels.

Usage

Use Laplacian pyramid loss when training image or video generation models where preserving both structural coherence and fine detail is important. It is particularly effective for frame interpolation, super-resolution, and image-to-image translation tasks.

Theoretical Basis

The Gaussian pyramid G is defined recursively:

G_0 = I (original image) G_k = downsample(smooth(G_{k-1}))

where smooth applies a Gaussian filter and downsample reduces resolution by 2x.

The Laplacian pyramid L is the difference between successive Gaussian levels:

L_k = G_k - upsample(G_{k+1})

Each L_k captures the detail lost during the smoothing and downsampling from level k to k+1, representing a specific spatial frequency band.

The Laplacian pyramid loss between predicted image P and target image T is:

L_lap(P, T) = sum_{k=0}^{K-1} ||L_k(P) - L_k(T)||_1

where K is the number of pyramid levels (typically 3-5).

The Gaussian kernel used is derived from the binomial coefficients of order 4, yielding a 5x5 separable kernel:

h = [1, 4, 6, 4, 1] / 16 K_gauss = h^T * h / 16

The upsample operation inserts zeros between pixels and convolves with 4 * K_gauss to interpolate the missing values, which is equivalent to bilinear interpolation with Gaussian smoothing.

This decomposition is invertible: the original image can be exactly reconstructed by iteratively upsampling and adding each Laplacian level, which makes the loss well-defined across all frequency components.

Related Pages

Implementation:Zai_org_CogVideo_LapLoss

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment