Implementation:Datajuicer Data juicer DocumentDeduplicator
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Deduplication |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for exact document-level deduplication using MD5 hashing provided by Data-Juicer.
Description
DocumentDeduplicator extends Deduplicator and computes an MD5 hash for each sample's text content. It optionally preprocesses text by converting to lowercase and/or removing non-alphabet characters (whitespace, digits, punctuation) via a compiled regex. The compute_hash method computes and stores the MD5 hex digest of the stripped, encoded text. The process method iterates through the dataset keeping only the first occurrence of each unique hash, using a growing set for O(1) membership testing. When the show_num parameter is greater than 0, it also collects sample duplicate pairs for tracing and inspection.
Usage
Use when you need to remove exact duplicate documents from text datasets as a baseline data cleaning step, providing the simplest and fastest deduplication approach.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/deduplicator/document_deduplicator.py
Signature
@OPERATORS.register_module("document_deduplicator")
class DocumentDeduplicator(Deduplicator):
def __init__(self, lowercase: bool = False,
ignore_non_character: bool = False,
*args, **kwargs):
Import
from data_juicer.ops.deduplicator.document_deduplicator import DocumentDeduplicator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| lowercase | bool | No | Whether to convert sample text to lower case before hashing. Default: False |
| ignore_non_character | bool | No | Whether to ignore non-alphabet characters (whitespace, digits, punctuation). Default: False |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | Deduplicated dataset with only the first occurrence of each unique document retained |
| dup_pairs | dict | Dictionary of sampled duplicate pairs (when show_num > 0) |
Usage Examples
process:
- document_deduplicator:
lowercase: true
ignore_non_character: true