Implementation:Datajuicer Data juicer DocumentDeduplicator

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Quality, Deduplication
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for exact document-level deduplication using MD5 hashing provided by Data-Juicer.

Description

DocumentDeduplicator extends Deduplicator and computes an MD5 hash for each sample's text content. It optionally preprocesses text by converting to lowercase and/or removing non-alphabet characters (whitespace, digits, punctuation) via a compiled regex. The compute_hash method computes and stores the MD5 hex digest of the stripped, encoded text. The process method iterates through the dataset keeping only the first occurrence of each unique hash, using a growing set for O(1) membership testing. When the show_num parameter is greater than 0, it also collects sample duplicate pairs for tracing and inspection.

Usage

Use when you need to remove exact duplicate documents from text datasets as a baseline data cleaning step, providing the simplest and fastest deduplication approach.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/deduplicator/document_deduplicator.py

Signature

@OPERATORS.register_module("document_deduplicator")
class DocumentDeduplicator(Deduplicator):
    def __init__(self, lowercase: bool = False,
                 ignore_non_character: bool = False,
                 *args, **kwargs):

Import

from data_juicer.ops.deduplicator.document_deduplicator import DocumentDeduplicator

I/O Contract

Inputs

Name	Type	Required	Description
lowercase	bool	No	Whether to convert sample text to lower case before hashing. Default: False
ignore_non_character	bool	No	Whether to ignore non-alphabet characters (whitespace, digits, punctuation). Default: False

Outputs

Name	Type	Description
dataset	Dataset	Deduplicated dataset with only the first occurrence of each unique document retained
dup_pairs	dict	Dictionary of sampled duplicate pairs (when show_num > 0)

Usage Examples

process:
  - document_deduplicator:
      lowercase: true
      ignore_non_character: true

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment