Implementation:Huggingface Datatrove GopherRepetitionFilter
| Sources | Domains | Last Updated |
|---|---|---|
| Huggingface Datatrove, Gopher (Rae et al. 2021) Table A1 | Data_Quality, NLP | 2026-02-14 00:00 GMT |
Overview
Filter implementation that removes documents with excessive textual repetition, checking duplicate lines, duplicate paragraphs, top n-gram fractions, and duplicate n-gram fractions against configurable thresholds from Gopher Table A1.
Description
GopherRepetitionFilter extends BaseFilter and evaluates a document against up to 13 repetition metrics. The filter splits text into paragraphs (by double newlines), lines (by single newlines), and words (via language-aware tokenization). It then computes duplicate fractions at each granularity and compares against configurable thresholds. If any single metric exceeds its threshold, the document is rejected.
Usage
Used as part of quality filtering after text extraction, typically as one of the first filters in a pipeline. Instantiate with default Gopher Table A1 thresholds or customize per metric. Pass None for any threshold to disable that specific check.
Code Reference
Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/filters/gopher_repetition_filter.py (L73-142)
Signature:
class GopherRepetitionFilter(BaseFilter):
def __init__(
self,
dup_line_frac: float | None = 0.3,
dup_para_frac: float | None = 0.3,
dup_line_char_frac: float | None = 0.2,
dup_para_char_frac: float | None = 0.2,
top_n_grams: tuple[tuple[int, float]] = ((2, 0.2), (3, 0.18), (4, 0.16)),
dup_n_grams: tuple[tuple[int, float]] = (
(5, 0.15), (6, 0.14), (7, 0.13),
(8, 0.12), (9, 0.11), (10, 0.10),
),
exclusion_writer: DiskWriter = None,
language: str = Languages.english,
):
Import:
from datatrove.pipeline.filters import GopherRepetitionFilter
Key Helper Functions:
def find_duplicates(x: list[str]) -> tuple[int, int]:
"""Returns (duplicate_element_count, duplicate_char_count) using set-based matching."""
def find_top_duplicate(x: list[str]) -> int:
"""Returns character length of the most frequent n-gram times its count."""
def find_all_duplicate(words: list[str], n: int) -> int:
"""Returns total characters in all duplicated n-grams using sliding window."""
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| dup_line_frac | float or None | No | Max fraction of duplicate lines (default 0.3) |
| dup_para_frac | float or None | No | Max fraction of duplicate paragraphs (default 0.3) |
| dup_line_char_frac | float or None | No | Max fraction of characters in duplicate lines (default 0.2) |
| dup_para_char_frac | float or None | No | Max fraction of characters in duplicate paragraphs (default 0.2) |
| top_n_grams | tuple[tuple[int, float]] | No | N-gram sizes and max top-frequency thresholds (default ((2, 0.2), (3, 0.18), (4, 0.16))) |
| dup_n_grams | tuple[tuple[int, float]] | No | N-gram sizes and max duplicate thresholds (default ((5, 0.15), ..., (10, 0.10))) |
| exclusion_writer | DiskWriter or None | No | Writer to save excluded documents (default None) |
| language | str | No | Language for word tokenization (default Languages.english) |
Filter Input: Document with plain text in doc.text
Filter Output: bool or tuple[bool, str] -- returns True if the document passes all repetition checks, or (False, reason_string) if any threshold is exceeded. The reason string identifies which specific metric failed (e.g., "dup_para_frac", "top_3_gram", "duplicated_7_n_grams").
Usage Examples
Example 1 -- Default Gopher Table A1 thresholds:
from datatrove.pipeline.filters import GopherRepetitionFilter
# Use default thresholds from Gopher paper Table A1
repetition_filter = GopherRepetitionFilter()
Example 2 -- Custom thresholds with stricter duplicate line check:
from datatrove.pipeline.filters import GopherRepetitionFilter
repetition_filter = GopherRepetitionFilter(
dup_line_frac=0.2, # stricter than default 0.3
dup_para_frac=0.3,
dup_line_char_frac=0.15, # stricter than default 0.2
dup_para_char_frac=0.2,
top_n_grams=((2, 0.2), (3, 0.18), (4, 0.16)),
dup_n_grams=((5, 0.15), (6, 0.14), (7, 0.13), (8, 0.12), (9, 0.11), (10, 0.10)),
)
Example 3 -- Disable specific checks by passing None:
from datatrove.pipeline.filters import GopherRepetitionFilter
# Only check n-gram duplication, skip line/paragraph checks
repetition_filter = GopherRepetitionFilter(
dup_line_frac=None,
dup_para_frac=None,
dup_line_char_frac=None,
dup_para_char_frac=None,
)
Example 4 -- In a pipeline:
from datatrove.pipeline.filters import GopherRepetitionFilter
from datatrove.pipeline.readers import JsonlReader
pipeline = [
JsonlReader("s3://my-bucket/input/"),
GopherRepetitionFilter(),
]