Implementation:Datajuicer Data juicer ImageDeduplicator
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Deduplication |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for document-level deduplication based on perceptual image hashing provided by Data-Juicer.
Description
ImageDeduplicator extends Deduplicator and uses the imagededup library to compute perceptual hashes of images, supporting phash, dhash, whash, and ahash methods. The compute_hash method loads all images from a sample, computes their perceptual hashes via the selected hasher, and concatenates them into a combined hash string stored as 'imagehash'. When consider_text is set to True, it optionally combines image hashes with text MD5 hashes (via DocumentDeduplicator) to create a composite deduplication key. The process method filters duplicates using a hash-set approach, keeping only first occurrences. The operator is registered with LOADED_IMAGES for operator fusion support.
Usage
Use when you need to remove visually duplicate images from image-centric datasets, ensuring training data diversity by identifying and filtering out samples containing identical or visually identical images.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/deduplicator/image_deduplicator.py
Signature
@OPERATORS.register_module("image_deduplicator")
class ImageDeduplicator(Deduplicator):
def __init__(self, method: str = "phash",
consider_text: bool = False,
*args, **kwargs):
Import
from data_juicer.ops.deduplicator.image_deduplicator import ImageDeduplicator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| method | str | No | Hash method for images: "phash", "dhash", "whash", or "ahash". Default: "phash" |
| consider_text | bool | No | Whether to consider text hash together with image hash for deduplication. Default: False |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | Deduplicated dataset with visually duplicate image samples removed |
| dup_pairs | dict | Dictionary of sampled duplicate pairs (when show_num > 0) |
Usage Examples
process:
- image_deduplicator:
method: "phash"
consider_text: true