Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove GopherRepetitionFilter

From Leeroopedia
Sources Domains Last Updated
Huggingface Datatrove, Gopher (Rae et al. 2021) Table A1 Data_Quality, NLP 2026-02-14 00:00 GMT

Overview

Filter implementation that removes documents with excessive textual repetition, checking duplicate lines, duplicate paragraphs, top n-gram fractions, and duplicate n-gram fractions against configurable thresholds from Gopher Table A1.

Description

GopherRepetitionFilter extends BaseFilter and evaluates a document against up to 13 repetition metrics. The filter splits text into paragraphs (by double newlines), lines (by single newlines), and words (via language-aware tokenization). It then computes duplicate fractions at each granularity and compares against configurable thresholds. If any single metric exceeds its threshold, the document is rejected.

Usage

Used as part of quality filtering after text extraction, typically as one of the first filters in a pipeline. Instantiate with default Gopher Table A1 thresholds or customize per metric. Pass None for any threshold to disable that specific check.

Code Reference

Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/filters/gopher_repetition_filter.py (L73-142)

Signature:

class GopherRepetitionFilter(BaseFilter):
    def __init__(
        self,
        dup_line_frac: float | None = 0.3,
        dup_para_frac: float | None = 0.3,
        dup_line_char_frac: float | None = 0.2,
        dup_para_char_frac: float | None = 0.2,
        top_n_grams: tuple[tuple[int, float]] = ((2, 0.2), (3, 0.18), (4, 0.16)),
        dup_n_grams: tuple[tuple[int, float]] = (
            (5, 0.15), (6, 0.14), (7, 0.13),
            (8, 0.12), (9, 0.11), (10, 0.10),
        ),
        exclusion_writer: DiskWriter = None,
        language: str = Languages.english,
    ):

Import:

from datatrove.pipeline.filters import GopherRepetitionFilter

Key Helper Functions:

def find_duplicates(x: list[str]) -> tuple[int, int]:
    """Returns (duplicate_element_count, duplicate_char_count) using set-based matching."""

def find_top_duplicate(x: list[str]) -> int:
    """Returns character length of the most frequent n-gram times its count."""

def find_all_duplicate(words: list[str], n: int) -> int:
    """Returns total characters in all duplicated n-grams using sliding window."""

I/O Contract

Inputs:

Parameter Type Required Description
dup_line_frac float or None No Max fraction of duplicate lines (default 0.3)
dup_para_frac float or None No Max fraction of duplicate paragraphs (default 0.3)
dup_line_char_frac float or None No Max fraction of characters in duplicate lines (default 0.2)
dup_para_char_frac float or None No Max fraction of characters in duplicate paragraphs (default 0.2)
top_n_grams tuple[tuple[int, float]] No N-gram sizes and max top-frequency thresholds (default ((2, 0.2), (3, 0.18), (4, 0.16)))
dup_n_grams tuple[tuple[int, float]] No N-gram sizes and max duplicate thresholds (default ((5, 0.15), ..., (10, 0.10)))
exclusion_writer DiskWriter or None No Writer to save excluded documents (default None)
language str No Language for word tokenization (default Languages.english)

Filter Input: Document with plain text in doc.text

Filter Output: bool or tuple[bool, str] -- returns True if the document passes all repetition checks, or (False, reason_string) if any threshold is exceeded. The reason string identifies which specific metric failed (e.g., "dup_para_frac", "top_3_gram", "duplicated_7_n_grams").

Usage Examples

Example 1 -- Default Gopher Table A1 thresholds:

from datatrove.pipeline.filters import GopherRepetitionFilter

# Use default thresholds from Gopher paper Table A1
repetition_filter = GopherRepetitionFilter()

Example 2 -- Custom thresholds with stricter duplicate line check:

from datatrove.pipeline.filters import GopherRepetitionFilter

repetition_filter = GopherRepetitionFilter(
    dup_line_frac=0.2,           # stricter than default 0.3
    dup_para_frac=0.3,
    dup_line_char_frac=0.15,     # stricter than default 0.2
    dup_para_char_frac=0.2,
    top_n_grams=((2, 0.2), (3, 0.18), (4, 0.16)),
    dup_n_grams=((5, 0.15), (6, 0.14), (7, 0.13), (8, 0.12), (9, 0.11), (10, 0.10)),
)

Example 3 -- Disable specific checks by passing None:

from datatrove.pipeline.filters import GopherRepetitionFilter

# Only check n-gram duplication, skip line/paragraph checks
repetition_filter = GopherRepetitionFilter(
    dup_line_frac=None,
    dup_para_frac=None,
    dup_line_char_frac=None,
    dup_para_char_frac=None,
)

Example 4 -- In a pipeline:

from datatrove.pipeline.filters import GopherRepetitionFilter
from datatrove.pipeline.readers import JsonlReader

pipeline = [
    JsonlReader("s3://my-bucket/input/"),
    GopherRepetitionFilter(),
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment