Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer ImagePairSimilarityFilter

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Filtering
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on cosine similarity between image pairs provided by Data-Juicer.

Description

ImagePairSimilarityFilter is a filter operator that keeps image pairs with similarities between images within a specific range. It uses a HuggingFace CLIP model (default: openai/clip-vit-base-patch32) to compute the cosine similarity between two images in each sample. Each sample must include exactly two distinct images. The similarity scores are cached under the image_pair_similarity stats key. The operator supports CUDA acceleration and 'any'/'all' strategies. It extends the Filter base class and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on visual similarity between pairs of images. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Signature

@OPERATORS.register_module("image_pair_similarity_filter")
@LOADED_IMAGES.register_module("image_pair_similarity_filter")
class ImagePairSimilarityFilter(Filter):
    def __init__(
        self,
        hf_clip="openai/clip-vit-base-patch32",
        trust_remote_code=False,
        min_score: ClosedUnitInterval = 0.1,
        max_score: ClosedUnitInterval = 1.0,
        any_or_all: str = "any",
        *args,
        **kwargs,
    ):
        ...

Import

from data_juicer.ops.filter.image_pair_similarity_filter import ImagePairSimilarityFilter

I/O Contract

Inputs

Name Type Required Description
hf_clip str No CLIP model name on HuggingFace for computing similarity. Default: "openai/clip-vit-base-patch32"
trust_remote_code bool No Whether to trust remote code of HF models. Default: False
min_score ClosedUnitInterval No The minimum similarity score to keep samples. Default: 0.1
max_score ClosedUnitInterval No The maximum similarity score to keep samples. Default: 1.0
any_or_all str No Keep strategy: 'any' or 'all' across image pairs. Default: "any"

Outputs

Name Type Description
samples Dict Filtered samples with stats field updated (image_pair_similarity)

Usage Examples

YAML Configuration

process:
  - image_pair_similarity_filter:
      hf_clip: "openai/clip-vit-base-patch32"
      min_score: 0.1
      max_score: 1.0
      any_or_all: "any"

Python API

from data_juicer.ops.filter.image_pair_similarity_filter import ImagePairSimilarityFilter

op = ImagePairSimilarityFilter(min_score=0.1, max_score=1.0)
# Apply to dataset
result = dataset.process(op)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment