Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer RemoveRepeatSentencesMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for removing duplicate sentences within text samples provided by Data-Juicer.

Description

RemoveRepeatSentencesMapper is a mapper operator that deduplicates repeated sentences within individual text samples. It splits text into lines and further into sentences using punctuation-based regex splitting, then tracks unique sentences via a hash set with optional case normalization and special character stripping. Only the first occurrence of each sentence above a minimum length threshold is preserved. Sentences shorter than min_repeat_sentence_length are not deduplicated. Operates in batched mode.

Usage

Use when cleaning web-scraped or noisy text data that contains boilerplate, repeated disclaimers, or duplicated content that degrades training data quality.

Code Reference

Source Location

Signature

@OPERATORS.register_module("remove_repeat_sentences_mapper")
class RemoveRepeatSentencesMapper(Mapper):
    def __init__(
        self,
        lowercase: bool = False,
        ignore_special_character: bool = True,
        min_repeat_sentence_length: int = 2,
        *args,
        **kwargs,
    ):

Import

from data_juicer.ops.mapper.remove_repeat_sentences_mapper import RemoveRepeatSentencesMapper

I/O Contract

Inputs

Name Type Required Description
lowercase bool No Whether to convert text to lower case for comparison (default: False)
ignore_special_character bool No Whether to ignore special characters when judging repeated sentences (default: True)
min_repeat_sentence_length int No Minimum sentence length for deduplication to apply (default: 2)

Outputs

Name Type Description
samples Dict Transformed samples with repeated sentences removed

Usage Examples

process:
  - remove_repeat_sentences_mapper:
      lowercase: false
      ignore_special_character: true
      min_repeat_sentence_length: 2

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment