Principle:Huggingface Datatrove Repetition Filtering

Sources	Domains	Last Updated
Gopher (Rae et al. 2021) Table A1	Data_Quality, NLP	2026-02-14 00:00 GMT

Overview

Detecting and removing documents with excessive textual repetition at line, paragraph, and n-gram levels.

Description

Repetition filtering identifies low-quality documents by measuring duplicate content at multiple granularities. The approach checks three categories of repetition:

Duplicate lines and paragraphs (exact match): The fraction of lines or paragraphs that appear more than once, and the fraction of characters belonging to those duplicated lines or paragraphs.
Top n-gram character fractions (2-4 grams): The character coverage of the single most frequent n-gram for n in {2, 3, 4}. A document dominated by a single repeated short phrase will have a high top n-gram fraction.
Duplicate n-gram character fractions (5-10 grams): The fraction of total characters that belong to any n-gram appearing more than once, for n in {5, 6, 7, 8, 9, 10}. This captures longer repeated passages and boilerplate text.

Each metric has a configurable threshold derived from Table A1 of the Gopher paper. Exceeding any single threshold causes the entire document to be removed. This multi-level approach catches both fine-grained repetition (repeated short phrases) and coarse-grained repetition (duplicated paragraphs or boilerplate blocks).

Usage

Applied as part of quality filtering after text extraction, to remove boilerplate-heavy and auto-generated content. Typically used early in the filtering pipeline since repetitive documents are a clear signal of low quality and removing them reduces downstream processing costs.

Theoretical Basis

The thresholds originate from Table A1 of the Gopher paper (Rae et al., 2021). The key design decisions are:

Duplicate detection via set-based matching: Lines and paragraphs are compared using set membership. If a line or paragraph has been seen before in the same document, it is counted as a duplicate. This is an O(n) approach using hash sets.
N-gram repetition via counting and character-length weighting: Rather than counting n-gram occurrences directly, the metrics weight by character length. This means longer repeated n-grams contribute proportionally more to the repetition score, which better captures the visual impact of repetition on document quality.
Graduated thresholds: Thresholds decrease as n-gram size increases (e.g., top 2-gram at 0.20, top 4-gram at 0.16; duplicate 5-gram at 0.15, duplicate 10-gram at 0.10). This reflects the intuition that longer repeated sequences are a stronger signal of low quality and should be tolerated less.

Metric	Default Threshold
Duplicate line fraction	0.30
Duplicate paragraph fraction	0.30
Duplicate line character fraction	0.20
Duplicate paragraph character fraction	0.20
Top 2-gram character fraction	0.20
Top 3-gram character fraction	0.18
Top 4-gram character fraction	0.16
Duplicate 5-gram character fraction	0.15
Duplicate 6-gram character fraction	0.14
Duplicate 7-gram character fraction	0.13
Duplicate 8-gram character fraction	0.12
Duplicate 9-gram character fraction	0.11
Duplicate 10-gram character fraction	0.10

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment