Implementation:Datajuicer Data juicer ImagePairSimilarityFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on cosine similarity between image pairs provided by Data-Juicer.
Description
ImagePairSimilarityFilter is a filter operator that keeps image pairs with similarities between images within a specific range. It uses a HuggingFace CLIP model (default: openai/clip-vit-base-patch32) to compute the cosine similarity between two images in each sample. Each sample must include exactly two distinct images. The similarity scores are cached under the image_pair_similarity stats key. The operator supports CUDA acceleration and 'any'/'all' strategies. It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on visual similarity between pairs of images. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/image_pair_similarity_filter.py
- Lines: 1-113
Signature
@OPERATORS.register_module("image_pair_similarity_filter")
@LOADED_IMAGES.register_module("image_pair_similarity_filter")
class ImagePairSimilarityFilter(Filter):
def __init__(
self,
hf_clip="openai/clip-vit-base-patch32",
trust_remote_code=False,
min_score: ClosedUnitInterval = 0.1,
max_score: ClosedUnitInterval = 1.0,
any_or_all: str = "any",
*args,
**kwargs,
):
...
Import
from data_juicer.ops.filter.image_pair_similarity_filter import ImagePairSimilarityFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hf_clip | str | No | CLIP model name on HuggingFace for computing similarity. Default: "openai/clip-vit-base-patch32" |
| trust_remote_code | bool | No | Whether to trust remote code of HF models. Default: False |
| min_score | ClosedUnitInterval | No | The minimum similarity score to keep samples. Default: 0.1 |
| max_score | ClosedUnitInterval | No | The maximum similarity score to keep samples. Default: 1.0 |
| any_or_all | str | No | Keep strategy: 'any' or 'all' across image pairs. Default: "any" |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (image_pair_similarity) |
Usage Examples
YAML Configuration
process:
- image_pair_similarity_filter:
hf_clip: "openai/clip-vit-base-patch32"
min_score: 0.1
max_score: 1.0
any_or_all: "any"
Python API
from data_juicer.ops.filter.image_pair_similarity_filter import ImagePairSimilarityFilter
op = ImagePairSimilarityFilter(min_score=0.1, max_score=1.0)
# Apply to dataset
result = dataset.process(op)