Implementation:LLMBook zh LLMBook zh github io CleanerDedupLineByNgram Clean Single Text

Knowledge Sources	LLMBook-zh
Domains	NLP, Data_Preprocessing
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for n-gram Jaccard similarity deduplication of adjacent lines provided by the LLMBook repository.

Description

The CleanerDedupLineByNgram class splits text into lines, computes n-gram sets for each line, and removes lines whose Jaccard similarity with their predecessor exceeds a threshold. It handles both English and Chinese punctuation as token delimiters for n-gram extraction.

Usage

Import this class when you need to remove near-duplicate adjacent lines from text passages during data preprocessing, after quality filtering and before privacy masking.

Code Reference

Source Location

Repository: LLMBook-zh
File: code/4.2 去重.py
Lines: 5-35

Signature

class CleanerDedupLineByNgram:
    def __init__(self):
        """
        Initializes line and gram delimiters.
        line_delimiter: ["\n"]
        gram_delimiter: English punctuation + Chinese punctuation + space
        """

    def clean_single_text(self, text: str, n: int = 5, thre_sim: float = 0.95) -> str:
        """
        Removes near-duplicate adjacent lines using n-gram Jaccard similarity.

        Args:
            text: Multi-line input text.
            n: N-gram size (default 5).
            thre_sim: Jaccard similarity threshold (default 0.95).

        Returns:
            Deduplicated text with near-duplicate adjacent lines removed.
        """

Import

from dedup import CleanerDedupLineByNgram

I/O Contract

Inputs

Name	Type	Required	Description
text	str	Yes	Multi-line text passage to deduplicate
n	int	No	N-gram size (default 5)
thre_sim	float	No	Jaccard similarity threshold (default 0.95)

Outputs

Name	Type	Description
return	str	Deduplicated text with near-duplicate adjacent lines removed

Usage Examples

from dedup import CleanerDedupLineByNgram

deduper = CleanerDedupLineByNgram()

text = """This is the first line of content.
This is the first line of content.
This is a different line entirely.
This is the first line of content."""

cleaned = deduper.clean_single_text(text, n=5, thre_sim=0.95)
print(cleaned)
# Output: removes adjacent near-duplicate lines

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment