Principle:LLMBook zh LLMBook zh github io Deduplication

Knowledge Sources	Deduplicating Training Data Makes Language Models Better LLMBook-zh
Domains	NLP, Data_Preprocessing
Last Updated	2026-02-08 00:00 GMT

Overview

A text deduplication technique that removes near-duplicate adjacent lines within a document using n-gram Jaccard similarity.

Description

Deduplication addresses the problem of repeated or near-identical content in web-crawled text corpora. Duplicate content in training data can lead to memorization, reduced diversity, and wasted compute during pre-training. This technique compares adjacent lines within a document by computing the Jaccard similarity of their character n-gram sets. If two adjacent lines exceed a similarity threshold, the duplicate line is removed.

This is a local (intra-document, adjacent-line) deduplication method, as opposed to global (cross-document) deduplication methods like MinHash LSH.

Usage

Use this principle after quality filtering and before privacy masking in a data preprocessing pipeline. It is most effective for removing boilerplate text, repeated headers/footers, and copy-pasted content within individual documents.

Theoretical Basis

The Jaccard similarity between two sets A and B is defined as:

$J (A, B) = \frac{| A \cap B |}{| A \cup B |}$

For deduplication, each line is converted to a set of character n-grams (default n=5), and adjacent lines are compared:

Pseudo-code:

# Abstract algorithm (NOT real implementation)
for each adjacent pair (line_i, line_j):
    ngrams_i = set(ngrams(tokenize(line_i), n))
    ngrams_j = set(ngrams(tokenize(line_j), n))
    similarity = len(ngrams_i & ngrams_j) / len(ngrams_i | ngrams_j)
    if similarity >= threshold:
        remove(line_j)  # Remove the duplicate

The threshold (default 0.95) controls the trade-off: higher values only catch near-exact duplicates, while lower values may remove legitimately similar content.

Related Pages

Implemented By

Implementation:LLMBook_zh_LLMBook_zh_github_io_CleanerDedupLineByNgram_Clean_Single_Text

Uses Heuristic

Heuristic:LLMBook_zh_LLMBook_zh_github_io_Deduplication_Ngram_Threshold

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment