Implementation:Huggingface Datatrove GopherQualityFilter
| Sources | Domains | Last Updated |
|---|---|---|
| Huggingface Datatrove, Gopher (Rae et al. 2021) | Data_Quality, NLP | 2026-02-14 00:00 GMT |
Overview
Filter implementation that applies Gopher paper quality heuristics to remove low-quality web text based on word count, word length, symbol ratios, structural indicators, and stop word presence.
Description
GopherQualityFilter extends BaseFilter and evaluates a document against a battery of heuristic rules. Words are split using a language-aware tokenizer, and non-symbol words (words containing at least one non-punctuation character) are used for word count and average length calculations. The filter checks each rule sequentially and returns False with a descriptive reason string on the first failure.
Usage
Used as the core quality filtering step for web-crawled text. Typically placed early in the pipeline after language identification and before repetition or content-specific filtering.
Code Reference
Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/filters/gopher_quality_filter.py (L13-125)
Signature:
class GopherQualityFilter(BaseFilter):
def __init__(
self,
min_doc_words: int | None = 50,
max_doc_words: int | None = 100000,
min_avg_word_length: int | None = 3,
max_avg_word_length: int | None = 10,
max_symbol_word_ratio: float | None = 0.1,
max_bullet_lines_ratio: float | None = 0.9,
max_ellipsis_lines_ratio: float | None = 0.3,
max_non_alpha_words_ratio: float | None = 0.8,
min_stop_words: int | None = 2,
stop_words: list[str] | None = None,
exclusion_writer: DiskWriter = None,
language: str = Languages.english,
):
Import:
from datatrove.pipeline.filters import GopherQualityFilter
Default Stop Words:
STOP_WORDS = ["the", "be", "to", "of", "and", "that", "have", "with"]
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| min_doc_words | int or None | No | Minimum non-symbol word count (default 50) |
| max_doc_words | int or None | No | Maximum non-symbol word count (default 100000) |
| min_avg_word_length | int or None | No | Minimum average word length in characters (default 3) |
| max_avg_word_length | int or None | No | Maximum average word length in characters (default 10) |
| max_symbol_word_ratio | float or None | No | Max ratio of # or ellipsis symbols to total words (default 0.1) |
| max_bullet_lines_ratio | float or None | No | Max fraction of lines starting with bullet points (default 0.9) |
| max_ellipsis_lines_ratio | float or None | No | Max fraction of lines ending with ellipsis (default 0.3) |
| max_non_alpha_words_ratio | float or None | No | Min fraction of words with at least one alpha character (default 0.8) |
| min_stop_words | int or None | No | Min number of unique stop words present (default 2) |
| stop_words | list[str] or None | No | Custom stop word list (default: Gopher English stop words) |
| exclusion_writer | DiskWriter or None | No | Writer to save excluded documents (default None) |
| language | str | No | Language for word tokenization (default Languages.english) |
Filter Input: Document with plain text in doc.text
Filter Output: bool or tuple[bool, str] -- returns True if the document passes all quality checks, or (False, reason_string) if any rule fails. Reason strings include:
| Reason String | Rule |
|---|---|
| "gopher_short_doc" | Word count below min_doc_words |
| "gopher_long_doc" | Word count above max_doc_words |
| "gopher_below_avg_threshold" | Average word length below minimum |
| "gopher_above_avg_threshold" | Average word length above maximum |
| "gopher_too_many_hashes" | Hash symbol ratio too high |
| "gopher_too_many_ellipsis" | Ellipsis symbol ratio too high |
| "gopher_too_many_bullets" | Too many bullet-point lines |
| "gopher_too_many_end_ellipsis" | Too many lines ending with ellipsis |
| "gopher_below_alpha_threshold" | Too few words with alphabetic characters |
| "gopher_enough_stop_words" | Too few stop words present |
Usage Examples
Example 1 -- Default Gopher heuristics:
from datatrove.pipeline.filters import GopherQualityFilter
quality_filter = GopherQualityFilter()
Example 2 -- Custom thresholds for shorter documents:
from datatrove.pipeline.filters import GopherQualityFilter
quality_filter = GopherQualityFilter(
min_doc_words=20,
max_doc_words=50000,
min_avg_word_length=2,
max_avg_word_length=12,
)
Example 3 -- Custom stop words for a specific domain:
from datatrove.pipeline.filters import GopherQualityFilter
quality_filter = GopherQualityFilter(
stop_words=["the", "is", "a", "for", "in", "on", "to"],
min_stop_words=3,
)
Example 4 -- In a pipeline with exclusion logging:
from datatrove.pipeline.filters import GopherQualityFilter
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.writers import JsonlWriter
pipeline = [
JsonlReader("s3://my-bucket/input/"),
GopherQualityFilter(
exclusion_writer=JsonlWriter("s3://my-bucket/excluded/gopher_quality/"),
),
]