Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:LLMBook zh LLMBook zh github io CleanerDedupLineByNgram Clean Single Text

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Preprocessing
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for n-gram Jaccard similarity deduplication of adjacent lines provided by the LLMBook repository.

Description

The CleanerDedupLineByNgram class splits text into lines, computes n-gram sets for each line, and removes lines whose Jaccard similarity with their predecessor exceeds a threshold. It handles both English and Chinese punctuation as token delimiters for n-gram extraction.

Usage

Import this class when you need to remove near-duplicate adjacent lines from text passages during data preprocessing, after quality filtering and before privacy masking.

Code Reference

Source Location

  • Repository: LLMBook-zh
  • File: code/4.2 去重.py
  • Lines: 5-35

Signature

class CleanerDedupLineByNgram:
    def __init__(self):
        """
        Initializes line and gram delimiters.
        line_delimiter: ["\n"]
        gram_delimiter: English punctuation + Chinese punctuation + space
        """

    def clean_single_text(self, text: str, n: int = 5, thre_sim: float = 0.95) -> str:
        """
        Removes near-duplicate adjacent lines using n-gram Jaccard similarity.

        Args:
            text: Multi-line input text.
            n: N-gram size (default 5).
            thre_sim: Jaccard similarity threshold (default 0.95).

        Returns:
            Deduplicated text with near-duplicate adjacent lines removed.
        """

Import

from dedup import CleanerDedupLineByNgram

I/O Contract

Inputs

Name Type Required Description
text str Yes Multi-line text passage to deduplicate
n int No N-gram size (default 5)
thre_sim float No Jaccard similarity threshold (default 0.95)

Outputs

Name Type Description
return str Deduplicated text with near-duplicate adjacent lines removed

Usage Examples

from dedup import CleanerDedupLineByNgram

deduper = CleanerDedupLineByNgram()

text = """This is the first line of content.
This is the first line of content.
This is a different line entirely.
This is the first line of content."""

cleaned = deduper.clean_single_text(text, n=5, thre_sim=0.95)
print(cleaned)
# Output: removes adjacent near-duplicate lines

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment