Principle:NVIDIA NeMo Curator Aesthetic Quality Filtering

Metadata
Knowledge Sources	Paper: LAION-5B
Domains	Data_Curation, Image_Processing, Quality_Assessment
Last Updated	2026-02-14

Overview

Aesthetic Quality Filtering is a technique for scoring and filtering images based on visual aesthetic quality using a learned MLP predictor applied to CLIP embeddings.

Description

Aesthetic Quality Filtering in NeMo Curator uses the LAION aesthetic predictor, a multi-layer perceptron (MLP) trained on top of CLIP embeddings, to score the visual appeal of images. Each image's CLIP embedding vector is passed through the aesthetic predictor, which outputs a scalar aesthetic quality score. Images are then filtered based on a configurable score threshold, retaining only those images that meet or exceed the desired aesthetic quality level. This approach enables efficient large-scale filtering because it operates on pre-computed CLIP embeddings rather than raw pixel data, making the scoring process lightweight and fast.

Usage

Use Aesthetic Quality Filtering after the CLIP Embedding stage to remove low-quality or visually unappealing images from the dataset. This stage is particularly useful when curating datasets for generative image models, where training on aesthetically pleasing images improves output quality. Adjust the score threshold based on the desired quality-quantity tradeoff for the specific use case.

Theoretical Basis

Aesthetic Quality Filtering is based on the principle that visual aesthetic quality can be predicted from learned image representations. The LAION aesthetic predictor is a linear probe or shallow MLP trained on human aesthetic ratings collected from various sources. The predictor takes CLIP image embeddings as input and produces a scalar score that correlates with human judgments of visual appeal. This approach leverages the rich semantic information captured by CLIP embeddings, which encode both content and style information relevant to aesthetic perception. The training data for the aesthetic predictor consists of images rated by humans on scales of visual quality, composition, and appeal. By learning a mapping from CLIP embedding space to aesthetic scores, the predictor generalizes across diverse image content and styles, enabling automated aesthetic filtering at scale without requiring human review of individual images.

Related Pages

Implementation:NVIDIA_NeMo_Curator_ImageAestheticFilterStage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment