Implementation:Datajuicer Data juicer RemoveRepeatSentencesMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing duplicate sentences within text samples provided by Data-Juicer.
Description
RemoveRepeatSentencesMapper is a mapper operator that deduplicates repeated sentences within individual text samples. It splits text into lines and further into sentences using punctuation-based regex splitting, then tracks unique sentences via a hash set with optional case normalization and special character stripping. Only the first occurrence of each sentence above a minimum length threshold is preserved. Sentences shorter than min_repeat_sentence_length are not deduplicated. Operates in batched mode.
Usage
Use when cleaning web-scraped or noisy text data that contains boilerplate, repeated disclaimers, or duplicated content that degrades training data quality.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/remove_repeat_sentences_mapper.py
Signature
@OPERATORS.register_module("remove_repeat_sentences_mapper")
class RemoveRepeatSentencesMapper(Mapper):
def __init__(
self,
lowercase: bool = False,
ignore_special_character: bool = True,
min_repeat_sentence_length: int = 2,
*args,
**kwargs,
):
Import
from data_juicer.ops.mapper.remove_repeat_sentences_mapper import RemoveRepeatSentencesMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| lowercase | bool | No | Whether to convert text to lower case for comparison (default: False) |
| ignore_special_character | bool | No | Whether to ignore special characters when judging repeated sentences (default: True) |
| min_repeat_sentence_length | int | No | Minimum sentence length for deduplication to apply (default: 2) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with repeated sentences removed |
Usage Examples
process:
- remove_repeat_sentences_mapper:
lowercase: false
ignore_special_character: true
min_repeat_sentence_length: 2