Principle:Zai org CogVideo Latent Perceptual Loss
| Knowledge Sources | |
|---|---|
| Domains | Perceptual_Loss, Latent_Space_Modeling |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Latent perceptual loss combines a reconstruction objective in latent space with a perceptual quality objective in decoded image space, allowing models to train efficiently in compressed representations while preserving human-perceived visual quality.
Description
When training generative models in latent space (as in latent diffusion models), the primary training loss operates on compressed latent representations rather than raw pixels. While this dramatically reduces computational cost, a pure latent-space loss (such as L2) does not directly account for how differences in latent space map to perceptual differences in image space. Two latent vectors with the same L2 distance may produce reconstructions that differ vastly in perceived quality.
Latent perceptual loss addresses this gap by introducing a secondary loss term computed in decoded image space. A frozen decoder maps predicted and target latent codes back to images, and a perceptual similarity metric (typically LPIPS) measures their perceptual distance. The total loss is a weighted sum:
An optional third term compares decoded predictions against original input images when the training setup involves resolution changes or other transformations that make direct latent comparison insufficient.
Usage
Use latent perceptual loss when training models that operate in a compressed latent space but must produce outputs that are perceptually faithful when decoded. This is particularly relevant for:
- Fine-tuning latent diffusion model encoders or decoders
- Training two-stage autoencoders where the second stage refines latent representations
- Any scenario where latent-space metrics alone do not capture perceptual quality
Theoretical Basis
The theoretical foundation rests on two observations:
1. Latent space is not perceptually uniform. The mapping from latent space to image space through a decoder is generally nonlinear. The Jacobian of the decoder varies across latent space, meaning equal perturbations in latent space cause varying perceptual changes:
2. Perceptual metrics capture human judgment. LPIPS computes a weighted combination of feature-level distances from a deep network:
where extracts normalized features from layer and are learned weights calibrated against human perceptual judgments.
The composite loss ensures that the model optimizes for both accurate latent-space reconstruction (stability and convergence) and perceptual quality (visual fidelity). The decoder is frozen during training so its gradients flow only through the predicted latents, effectively providing a perceptually-informed gradient signal to the latent-space model.