Principle:Huggingface Datatrove Repetition Filtering
| Sources | Domains | Last Updated |
|---|---|---|
| Gopher (Rae et al. 2021) Table A1 | Data_Quality, NLP | 2026-02-14 00:00 GMT |
Overview
Detecting and removing documents with excessive textual repetition at line, paragraph, and n-gram levels.
Description
Repetition filtering identifies low-quality documents by measuring duplicate content at multiple granularities. The approach checks three categories of repetition:
- Duplicate lines and paragraphs (exact match): The fraction of lines or paragraphs that appear more than once, and the fraction of characters belonging to those duplicated lines or paragraphs.
- Top n-gram character fractions (2-4 grams): The character coverage of the single most frequent n-gram for n in {2, 3, 4}. A document dominated by a single repeated short phrase will have a high top n-gram fraction.
- Duplicate n-gram character fractions (5-10 grams): The fraction of total characters that belong to any n-gram appearing more than once, for n in {5, 6, 7, 8, 9, 10}. This captures longer repeated passages and boilerplate text.
Each metric has a configurable threshold derived from Table A1 of the Gopher paper. Exceeding any single threshold causes the entire document to be removed. This multi-level approach catches both fine-grained repetition (repeated short phrases) and coarse-grained repetition (duplicated paragraphs or boilerplate blocks).
Usage
Applied as part of quality filtering after text extraction, to remove boilerplate-heavy and auto-generated content. Typically used early in the filtering pipeline since repetitive documents are a clear signal of low quality and removing them reduces downstream processing costs.
Theoretical Basis
The thresholds originate from Table A1 of the Gopher paper (Rae et al., 2021). The key design decisions are:
- Duplicate detection via set-based matching: Lines and paragraphs are compared using set membership. If a line or paragraph has been seen before in the same document, it is counted as a duplicate. This is an O(n) approach using hash sets.
- N-gram repetition via counting and character-length weighting: Rather than counting n-gram occurrences directly, the metrics weight by character length. This means longer repeated n-grams contribute proportionally more to the repetition score, which better captures the visual impact of repetition on document quality.
- Graduated thresholds: Thresholds decrease as n-gram size increases (e.g., top 2-gram at 0.20, top 4-gram at 0.16; duplicate 5-gram at 0.15, duplicate 10-gram at 0.10). This reflects the intuition that longer repeated sequences are a stronger signal of low quality and should be tolerated less.
| Metric | Default Threshold |
|---|---|
| Duplicate line fraction | 0.30 |
| Duplicate paragraph fraction | 0.30 |
| Duplicate line character fraction | 0.20 |
| Duplicate paragraph character fraction | 0.20 |
| Top 2-gram character fraction | 0.20 |
| Top 3-gram character fraction | 0.18 |
| Top 4-gram character fraction | 0.16 |
| Duplicate 5-gram character fraction | 0.15 |
| Duplicate 6-gram character fraction | 0.14 |
| Duplicate 7-gram character fraction | 0.13 |
| Duplicate 8-gram character fraction | 0.12 |
| Duplicate 9-gram character fraction | 0.11 |
| Duplicate 10-gram character fraction | 0.10 |