Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove GopherQualityFilter

From Leeroopedia
Sources Domains Last Updated
Huggingface Datatrove, Gopher (Rae et al. 2021) Data_Quality, NLP 2026-02-14 00:00 GMT

Overview

Filter implementation that applies Gopher paper quality heuristics to remove low-quality web text based on word count, word length, symbol ratios, structural indicators, and stop word presence.

Description

GopherQualityFilter extends BaseFilter and evaluates a document against a battery of heuristic rules. Words are split using a language-aware tokenizer, and non-symbol words (words containing at least one non-punctuation character) are used for word count and average length calculations. The filter checks each rule sequentially and returns False with a descriptive reason string on the first failure.

Usage

Used as the core quality filtering step for web-crawled text. Typically placed early in the pipeline after language identification and before repetition or content-specific filtering.

Code Reference

Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/filters/gopher_quality_filter.py (L13-125)

Signature:

class GopherQualityFilter(BaseFilter):
    def __init__(
        self,
        min_doc_words: int | None = 50,
        max_doc_words: int | None = 100000,
        min_avg_word_length: int | None = 3,
        max_avg_word_length: int | None = 10,
        max_symbol_word_ratio: float | None = 0.1,
        max_bullet_lines_ratio: float | None = 0.9,
        max_ellipsis_lines_ratio: float | None = 0.3,
        max_non_alpha_words_ratio: float | None = 0.8,
        min_stop_words: int | None = 2,
        stop_words: list[str] | None = None,
        exclusion_writer: DiskWriter = None,
        language: str = Languages.english,
    ):

Import:

from datatrove.pipeline.filters import GopherQualityFilter

Default Stop Words:

STOP_WORDS = ["the", "be", "to", "of", "and", "that", "have", "with"]

I/O Contract

Inputs:

Parameter Type Required Description
min_doc_words int or None No Minimum non-symbol word count (default 50)
max_doc_words int or None No Maximum non-symbol word count (default 100000)
min_avg_word_length int or None No Minimum average word length in characters (default 3)
max_avg_word_length int or None No Maximum average word length in characters (default 10)
max_symbol_word_ratio float or None No Max ratio of # or ellipsis symbols to total words (default 0.1)
max_bullet_lines_ratio float or None No Max fraction of lines starting with bullet points (default 0.9)
max_ellipsis_lines_ratio float or None No Max fraction of lines ending with ellipsis (default 0.3)
max_non_alpha_words_ratio float or None No Min fraction of words with at least one alpha character (default 0.8)
min_stop_words int or None No Min number of unique stop words present (default 2)
stop_words list[str] or None No Custom stop word list (default: Gopher English stop words)
exclusion_writer DiskWriter or None No Writer to save excluded documents (default None)
language str No Language for word tokenization (default Languages.english)

Filter Input: Document with plain text in doc.text

Filter Output: bool or tuple[bool, str] -- returns True if the document passes all quality checks, or (False, reason_string) if any rule fails. Reason strings include:

Reason String Rule
"gopher_short_doc" Word count below min_doc_words
"gopher_long_doc" Word count above max_doc_words
"gopher_below_avg_threshold" Average word length below minimum
"gopher_above_avg_threshold" Average word length above maximum
"gopher_too_many_hashes" Hash symbol ratio too high
"gopher_too_many_ellipsis" Ellipsis symbol ratio too high
"gopher_too_many_bullets" Too many bullet-point lines
"gopher_too_many_end_ellipsis" Too many lines ending with ellipsis
"gopher_below_alpha_threshold" Too few words with alphabetic characters
"gopher_enough_stop_words" Too few stop words present

Usage Examples

Example 1 -- Default Gopher heuristics:

from datatrove.pipeline.filters import GopherQualityFilter

quality_filter = GopherQualityFilter()

Example 2 -- Custom thresholds for shorter documents:

from datatrove.pipeline.filters import GopherQualityFilter

quality_filter = GopherQualityFilter(
    min_doc_words=20,
    max_doc_words=50000,
    min_avg_word_length=2,
    max_avg_word_length=12,
)

Example 3 -- Custom stop words for a specific domain:

from datatrove.pipeline.filters import GopherQualityFilter

quality_filter = GopherQualityFilter(
    stop_words=["the", "is", "a", "for", "in", "on", "to"],
    min_stop_words=3,
)

Example 4 -- In a pipeline with exclusion logging:

from datatrove.pipeline.filters import GopherQualityFilter
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.writers import JsonlWriter

pipeline = [
    JsonlReader("s3://my-bucket/input/"),
    GopherQualityFilter(
        exclusion_writer=JsonlWriter("s3://my-bucket/excluded/gopher_quality/"),
    ),
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment