Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LLMBook zh LLMBook zh github io Deduplication

From Leeroopedia
Revision as of 17:28, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/LLMBook_zh_LLMBook_zh_github_io_Deduplication.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Data_Preprocessing
Last Updated 2026-02-08 00:00 GMT

Overview

A text deduplication technique that removes near-duplicate adjacent lines within a document using n-gram Jaccard similarity.

Description

Deduplication addresses the problem of repeated or near-identical content in web-crawled text corpora. Duplicate content in training data can lead to memorization, reduced diversity, and wasted compute during pre-training. This technique compares adjacent lines within a document by computing the Jaccard similarity of their character n-gram sets. If two adjacent lines exceed a similarity threshold, the duplicate line is removed.

This is a local (intra-document, adjacent-line) deduplication method, as opposed to global (cross-document) deduplication methods like MinHash LSH.

Usage

Use this principle after quality filtering and before privacy masking in a data preprocessing pipeline. It is most effective for removing boilerplate text, repeated headers/footers, and copy-pasted content within individual documents.

Theoretical Basis

The Jaccard similarity between two sets A and B is defined as:

J(A,B)=|AB||AB|

For deduplication, each line is converted to a set of character n-grams (default n=5), and adjacent lines are compared:

Pseudo-code:

# Abstract algorithm (NOT real implementation)
for each adjacent pair (line_i, line_j):
    ngrams_i = set(ngrams(tokenize(line_i), n))
    ngrams_j = set(ngrams(tokenize(line_j), n))
    similarity = len(ngrams_i & ngrams_j) / len(ngrams_i | ngrams_j)
    if similarity >= threshold:
        remove(line_j)  # Remove the duplicate

The threshold (default 0.95) controls the trade-off: higher values only catch near-exact duplicates, while lower values may remove legitimately similar content.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment