Heuristic:Huggingface Datatrove Gopher Quality Thresholds
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Quality |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Empirically validated quality filtering thresholds from the DeepMind Gopher paper for removing low-quality web-crawled documents based on word counts, word lengths, symbol ratios, and structural indicators.
Description
The Gopher quality heuristics are a set of document-level filters derived from DeepMind's research on training the Gopher language model. They define numerical thresholds for properties like minimum/maximum word count, average word length, symbol-to-word ratio, bullet point density, ellipsis frequency, alphabetic word ratio, and stop word presence. Documents failing any threshold are removed from the training corpus. These thresholds were empirically tuned on web-crawled data and have become a de facto standard in large-scale dataset curation.
Usage
Use these thresholds when building a quality filtering pipeline for web-crawled text data. They are specifically designed for English text (stop words are English) but can be adapted for other languages by changing the stop word list. Apply after URL filtering and text extraction, but before deduplication, as quality filtering is cheaper than dedup and reduces the corpus size early.
The Insight (Rule of Thumb)
Quality Thresholds (GopherQualityFilter defaults):
- min_doc_words: 50 — Documents with fewer than 50 words are too short to be useful training data.
- max_doc_words: 100,000 — Documents exceeding 100K words are likely concatenated or malformed.
- min_avg_word_length: 3 — Average word length below 3 characters signals gibberish, code, or symbol-heavy text.
- max_avg_word_length: 10 — Average word length above 10 characters signals non-natural-language content (URLs, encoded data).
- max_symbol_word_ratio: 0.1 — More than 10% symbol words (containing #, ...) indicates non-prose content.
- max_bullet_lines_ratio: 0.9 — More than 90% bullet-pointed lines indicates a list page, not natural text.
- max_ellipsis_lines_ratio: 0.3 — More than 30% of lines ending in ellipsis indicates truncated or clickbait content.
- max_non_alpha_words_ratio: 0.8 — At least 20% of words must contain alphabetic characters (this catches code, markup, corrupted text).
- min_stop_words: 2 — Documents must contain at least 2 stop words ("the", "be", "to", "of", "and", "that", "have", "with") to be considered natural language.
- Trade-off: These thresholds are conservative — they remove clearly bad documents but may miss subtle quality issues. Stack with additional filters (C4, FineWeb) for higher quality.
Repetition Thresholds (GopherRepetitionFilter defaults, Table A1):
- dup_line_frac: 0.30 — Max 30% duplicate lines
- dup_para_frac: 0.30 — Max 30% duplicate paragraphs
- dup_line_char_frac: 0.20 — Max 20% duplicate line characters
- dup_para_char_frac: 0.20 — Max 20% duplicate paragraph characters
- Top n-gram fractions: (2, 0.20), (3, 0.18), (4, 0.16) — Decreasing thresholds for larger n-grams
- Duplicate n-gram fractions: (5, 0.15), (6, 0.14), (7, 0.13), (8, 0.12), (9, 0.11), (10, 0.10) — Exact phrase repetition is more damaging than small fragment repetition
- Trade-off: Lower thresholds catch more repetitive content but may remove legitimate documents with repeated structures (e.g., catalogs).
Reasoning
These thresholds come from Table A1 of the Gopher paper (arXiv:2112.11446), which describes the quality filters applied to the MassiveText dataset used to train the 280B parameter Gopher model. The key insight is that simple statistical properties at the document level (word counts, character ratios, repetition metrics) are surprisingly effective at distinguishing high-quality natural language from web noise, boilerplate, and spam.
The repetition thresholds decrease as n-gram size increases because longer exact matches are stronger signals of machine-generated or template-based content. A repeated 2-gram might be coincidental, but a repeated 10-gram is almost certainly copy-pasted content.
The 80% alpha word ratio is particularly effective because it catches documents that are primarily code, HTML/XML markup, URL lists, or binary data rendered as text — all common artifacts in web crawls.
Code evidence from `src/datatrove/pipeline/filters/gopher_repetition_filter.py:11-27`:
"""
Table A1 from https://arxiv.org/pdf/2112.11446.pdf
duplicate line fraction 0.30
duplicate paragraph fraction 0.30
duplicate line character fraction 0.20
duplicate paragraph character fraction 0.20
top 2-gram character fraction 0.20
top 3-gram character fraction 0.18
top 4-gram character fraction 0.16
duplicate 5-gram character fraction 0.15
duplicate 6-gram character fraction 0.14
duplicate 7-gram character fraction 0.13
duplicate 8-gram character fraction 0.12
duplicate 9-gram character fraction 0.11
duplicate 10-gram character fraction 0.10
"""
Code evidence from `src/datatrove/pipeline/filters/gopher_quality_filter.py:16-29`:
def __init__(
self,
min_doc_words: int | None = 50,
max_doc_words: int | None = 100000,
min_avg_word_length: int | None = 3,
max_avg_word_length: int | None = 10,
max_symbol_word_ratio: float | None = 0.1,
max_bullet_lines_ratio: float | None = 0.9,
max_ellipsis_lines_ratio: float | None = 0.3,
max_non_alpha_words_ratio: float | None = 0.8,
min_stop_words: int | None = 2,
...