Implementation:LLMBook zh LLMBook zh github io CleanerDedupLineByNgram Clean Single Text
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Preprocessing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for n-gram Jaccard similarity deduplication of adjacent lines provided by the LLMBook repository.
Description
The CleanerDedupLineByNgram class splits text into lines, computes n-gram sets for each line, and removes lines whose Jaccard similarity with their predecessor exceeds a threshold. It handles both English and Chinese punctuation as token delimiters for n-gram extraction.
Usage
Import this class when you need to remove near-duplicate adjacent lines from text passages during data preprocessing, after quality filtering and before privacy masking.
Code Reference
Source Location
- Repository: LLMBook-zh
- File: code/4.2 去重.py
- Lines: 5-35
Signature
class CleanerDedupLineByNgram:
def __init__(self):
"""
Initializes line and gram delimiters.
line_delimiter: ["\n"]
gram_delimiter: English punctuation + Chinese punctuation + space
"""
def clean_single_text(self, text: str, n: int = 5, thre_sim: float = 0.95) -> str:
"""
Removes near-duplicate adjacent lines using n-gram Jaccard similarity.
Args:
text: Multi-line input text.
n: N-gram size (default 5).
thre_sim: Jaccard similarity threshold (default 0.95).
Returns:
Deduplicated text with near-duplicate adjacent lines removed.
"""
Import
from dedup import CleanerDedupLineByNgram
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str | Yes | Multi-line text passage to deduplicate |
| n | int | No | N-gram size (default 5) |
| thre_sim | float | No | Jaccard similarity threshold (default 0.95) |
Outputs
| Name | Type | Description |
|---|---|---|
| return | str | Deduplicated text with near-duplicate adjacent lines removed |
Usage Examples
from dedup import CleanerDedupLineByNgram
deduper = CleanerDedupLineByNgram()
text = """This is the first line of content.
This is the first line of content.
This is a different line entirely.
This is the first line of content."""
cleaned = deduper.clean_single_text(text, n=5, thre_sim=0.95)
print(cleaned)
# Output: removes adjacent near-duplicate lines