Heuristic:LLMBook zh LLMBook zh github io Deduplication Ngram Threshold

Knowledge Sources	LLMBook-zh Deduplicating Training Data Makes Language Models Better
Domains	Data_Processing, NLP
Last Updated	2026-02-08 04:30 GMT

Overview

Use 5-gram Jaccard similarity with a threshold of 0.95 for adjacent-line deduplication in text preprocessing.

Description

The deduplication algorithm compares adjacent lines using n-gram Jaccard similarity. Each line is split into tokens using punctuation and whitespace as delimiters, then n-grams (default n=5) are computed. If the Jaccard similarity between adjacent lines exceeds the threshold (0.95), the second line is removed. This conservative threshold ensures only near-identical lines are removed while preserving similar but distinct content.

Usage

Use this heuristic during data preprocessing to remove highly repetitive adjacent lines in training corpora. The 0.95 threshold is intentionally conservative to avoid false positives. Lower the threshold (e.g., 0.8) for more aggressive deduplication. Increase n-gram size for stricter matching.

The Insight (Rule of Thumb)

Action: Set `n=5` for n-gram size and `thre_sim=0.95` for Jaccard similarity threshold.
Value: n=5 provides mid-level granularity; threshold=0.95 is conservative (only near-duplicates removed).
Trade-off: Higher threshold = fewer false positives but more duplicates remain. Lower threshold = more aggressive deduplication but risk of removing legitimate similar content.
Scope: Only compares adjacent lines, not all pairs (O(n) vs O(n^2)).

Reasoning

5-grams capture enough context to identify substantive overlap while being short enough to match across minor variations. The 0.95 threshold means lines must share 95% of their n-grams to be considered duplicates. The adjacent-line comparison strategy is computationally efficient (linear time) and targets the most common duplication pattern in web-scraped text: repeated headers, footers, navigation elements, and boilerplate that appear on consecutive lines.

Code Evidence:

Default parameters from `code/4.2 去重.py:11`:

def clean_single_text(self, text: str, n: int = 5, thre_sim: float = 0.95) -> str:

Jaccard similarity computation from `code/4.2 去重.py:28-31`:

ngrams_last, ngrams_cur = set(last["ngrams"]), set(each["ngrams"])
ngrams_intersection = len(ngrams_last.intersection(ngrams_cur))
ngrams_union = len(ngrams_last.union(ngrams_cur))
jaccard_sim = ngrams_intersection / ngrams_union if ngrams_union != 0 else 0

Filtering decision from `code/4.2 去重.py:31-32`:

if jaccard_sim < thre_sim:
    each["keep"], last = 1, each

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment