Principle:LLMBook zh LLMBook zh github io Deduplication
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Preprocessing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
A text deduplication technique that removes near-duplicate adjacent lines within a document using n-gram Jaccard similarity.
Description
Deduplication addresses the problem of repeated or near-identical content in web-crawled text corpora. Duplicate content in training data can lead to memorization, reduced diversity, and wasted compute during pre-training. This technique compares adjacent lines within a document by computing the Jaccard similarity of their character n-gram sets. If two adjacent lines exceed a similarity threshold, the duplicate line is removed.
This is a local (intra-document, adjacent-line) deduplication method, as opposed to global (cross-document) deduplication methods like MinHash LSH.
Usage
Use this principle after quality filtering and before privacy masking in a data preprocessing pipeline. It is most effective for removing boilerplate text, repeated headers/footers, and copy-pasted content within individual documents.
Theoretical Basis
The Jaccard similarity between two sets A and B is defined as:
For deduplication, each line is converted to a set of character n-grams (default n=5), and adjacent lines are compared:
Pseudo-code:
# Abstract algorithm (NOT real implementation)
for each adjacent pair (line_i, line_j):
ngrams_i = set(ngrams(tokenize(line_i), n))
ngrams_j = set(ngrams(tokenize(line_j), n))
similarity = len(ngrams_i & ngrams_j) / len(ngrams_i | ngrams_j)
if similarity >= threshold:
remove(line_j) # Remove the duplicate
The threshold (default 0.95) controls the trade-off: higher values only catch near-exact duplicates, while lower values may remove legitimately similar content.