Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer DocumentDeduplicator

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Deduplication
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for exact document-level deduplication using MD5 hashing provided by Data-Juicer.

Description

DocumentDeduplicator extends Deduplicator and computes an MD5 hash for each sample's text content. It optionally preprocesses text by converting to lowercase and/or removing non-alphabet characters (whitespace, digits, punctuation) via a compiled regex. The compute_hash method computes and stores the MD5 hex digest of the stripped, encoded text. The process method iterates through the dataset keeping only the first occurrence of each unique hash, using a growing set for O(1) membership testing. When the show_num parameter is greater than 0, it also collects sample duplicate pairs for tracing and inspection.

Usage

Use when you need to remove exact duplicate documents from text datasets as a baseline data cleaning step, providing the simplest and fastest deduplication approach.

Code Reference

Source Location

Signature

@OPERATORS.register_module("document_deduplicator")
class DocumentDeduplicator(Deduplicator):
    def __init__(self, lowercase: bool = False,
                 ignore_non_character: bool = False,
                 *args, **kwargs):

Import

from data_juicer.ops.deduplicator.document_deduplicator import DocumentDeduplicator

I/O Contract

Inputs

Name Type Required Description
lowercase bool No Whether to convert sample text to lower case before hashing. Default: False
ignore_non_character bool No Whether to ignore non-alphabet characters (whitespace, digits, punctuation). Default: False

Outputs

Name Type Description
dataset Dataset Deduplicated dataset with only the first occurrence of each unique document retained
dup_pairs dict Dictionary of sampled duplicate pairs (when show_num > 0)

Usage Examples

process:
  - document_deduplicator:
      lowercase: true
      ignore_non_character: true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment