Implementation:Datajuicer Data juicer ImageAestheticsFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on image aesthetics scores provided by Data-Juicer.
Description
ImageAestheticsFilter is a filter operator that keeps samples with aesthetics scores within a specific range. It uses a HuggingFace model (default: shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE) to predict image aesthetics. Scores are normalized by dividing by 10 if the model name includes 'shunk031/aesthetics-predictor'. The operator supports CUDA acceleration and 'any'/'all' strategies across multiple images per sample. The key metric image_aesthetics_scores is cached in the stats field. It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on the aesthetic quality of images. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/image_aesthetics_filter.py
- Lines: 1-128
Signature
@OPERATORS.register_module("image_aesthetics_filter")
@LOADED_IMAGES.register_module("image_aesthetics_filter")
class ImageAestheticsFilter(Filter):
def __init__(
self,
hf_scorer_model: str = "",
trust_remote_code: bool = False,
min_score: float = 0.5,
max_score: float = 1.0,
any_or_all: str = "any",
*args,
**kwargs,
):
...
Import
from data_juicer.ops.filter.image_aesthetics_filter import ImageAestheticsFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hf_scorer_model | str | No | HuggingFace model name for aesthetics prediction. Default: "shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE" |
| trust_remote_code | bool | No | Whether to trust remote code of HF models. Default: False |
| min_score | float | No | Minimum aesthetics score to keep samples. Default: 0.5 |
| max_score | float | No | Maximum aesthetics score to keep samples. Default: 1.0 |
| any_or_all | str | No | Keep strategy: 'any' or 'all' across images. Default: "any" |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (image_aesthetics_scores) |
Usage Examples
YAML Configuration
process:
- image_aesthetics_filter:
min_score: 0.5
max_score: 1.0
any_or_all: "any"
Python API
from data_juicer.ops.filter.image_aesthetics_filter import ImageAestheticsFilter
op = ImageAestheticsFilter(min_score=0.5, max_score=1.0)
# Apply to dataset
result = dataset.process(op)