Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer ImageDeduplicator

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Deduplication
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for document-level deduplication based on perceptual image hashing provided by Data-Juicer.

Description

ImageDeduplicator extends Deduplicator and uses the imagededup library to compute perceptual hashes of images, supporting phash, dhash, whash, and ahash methods. The compute_hash method loads all images from a sample, computes their perceptual hashes via the selected hasher, and concatenates them into a combined hash string stored as 'imagehash'. When consider_text is set to True, it optionally combines image hashes with text MD5 hashes (via DocumentDeduplicator) to create a composite deduplication key. The process method filters duplicates using a hash-set approach, keeping only first occurrences. The operator is registered with LOADED_IMAGES for operator fusion support.

Usage

Use when you need to remove visually duplicate images from image-centric datasets, ensuring training data diversity by identifying and filtering out samples containing identical or visually identical images.

Code Reference

Source Location

Signature

@OPERATORS.register_module("image_deduplicator")
class ImageDeduplicator(Deduplicator):
    def __init__(self, method: str = "phash",
                 consider_text: bool = False,
                 *args, **kwargs):

Import

from data_juicer.ops.deduplicator.image_deduplicator import ImageDeduplicator

I/O Contract

Inputs

Name Type Required Description
method str No Hash method for images: "phash", "dhash", "whash", or "ahash". Default: "phash"
consider_text bool No Whether to consider text hash together with image hash for deduplication. Default: False

Outputs

Name Type Description
dataset Dataset Deduplicated dataset with visually duplicate image samples removed
dup_pairs dict Dictionary of sampled duplicate pairs (when show_num > 0)

Usage Examples

process:
  - image_deduplicator:
      method: "phash"
      consider_text: true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment