Principle:NVIDIA NeMo Curator Aesthetic Quality Filtering
| Metadata | |
|---|---|
| Knowledge Sources | Paper: LAION-5B |
| Domains | Data_Curation, Image_Processing, Quality_Assessment |
| Last Updated | 2026-02-14 |
Overview
Aesthetic Quality Filtering is a technique for scoring and filtering images based on visual aesthetic quality using a learned MLP predictor applied to CLIP embeddings.
Description
Aesthetic Quality Filtering in NeMo Curator uses the LAION aesthetic predictor, a multi-layer perceptron (MLP) trained on top of CLIP embeddings, to score the visual appeal of images. Each image's CLIP embedding vector is passed through the aesthetic predictor, which outputs a scalar aesthetic quality score. Images are then filtered based on a configurable score threshold, retaining only those images that meet or exceed the desired aesthetic quality level. This approach enables efficient large-scale filtering because it operates on pre-computed CLIP embeddings rather than raw pixel data, making the scoring process lightweight and fast.
Usage
Use Aesthetic Quality Filtering after the CLIP Embedding stage to remove low-quality or visually unappealing images from the dataset. This stage is particularly useful when curating datasets for generative image models, where training on aesthetically pleasing images improves output quality. Adjust the score threshold based on the desired quality-quantity tradeoff for the specific use case.
Theoretical Basis
Aesthetic Quality Filtering is based on the principle that visual aesthetic quality can be predicted from learned image representations. The LAION aesthetic predictor is a linear probe or shallow MLP trained on human aesthetic ratings collected from various sources. The predictor takes CLIP image embeddings as input and produces a scalar score that correlates with human judgments of visual appeal. This approach leverages the rich semantic information captured by CLIP embeddings, which encode both content and style information relevant to aesthetic perception. The training data for the aesthetic predictor consists of images rated by humans on scales of visual quality, composition, and appeal. By learning a mapping from CLIP embedding space to aesthetic scores, the predictor generalizes across diverse image content and styles, enabling automated aesthetic filtering at scale without requiring human review of individual images.