Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Datatrove Gopher Quality Heuristics

From Leeroopedia
Sources Domains Last Updated
Gopher (Rae et al. 2021) Data_Quality, NLP 2026-02-14 00:00 GMT

Overview

Applying heuristic quality rules from DeepMind's Gopher paper to filter low-quality web text based on word count, word length, symbol ratios, structural indicators, and stop word presence.

Description

Quality heuristics from the Gopher paper filter documents based on a set of configurable rules that target different aspects of text quality:

  • Word count bounds (50-100K): Documents with too few words lack sufficient content, while documents with extremely high word counts are often data dumps or concatenated pages.
  • Average word length (3-10 chars): Average word lengths outside this range indicate non-natural-language content such as encoded data, URL lists, or single-character spam.
  • Symbol-to-word ratio (hash, ellipsis): A high ratio of hash symbols (#) or ellipsis (..., ...) relative to total word count indicates markup-heavy or truncated content.
  • Structural indicators (bullet lines, ellipsis endings): Documents where more than 90% of lines begin with bullet points or more than 30% end with ellipsis are likely navigation menus, table-of-contents pages, or truncated listings.
  • Alphabetic character ratio: At least 80% of words must contain at least one alphabetic character, filtering out documents dominated by numbers, symbols, or encoded data.
  • Stop word presence: Documents must contain at least 2 stop words (from a configurable list defaulting to common English function words), ensuring the text is natural language rather than keyword lists or code.

Each rule has a configurable threshold. A document is removed if it fails any single rule.

Usage

Used as the core quality filtering step for web-crawled text data. Typically applied after language identification and before more targeted filters such as repetition or content-specific filtering. This filter is often the first line of defense in production pipelines like FineWeb.

Theoretical Basis

The heuristics originate from the Gopher paper (Rae et al., 2021) and are designed to identify low-quality web content through multiple complementary signals:

  • Boilerplate detection: The bullet-line and ellipsis-ending rules target navigation bars, menus, and table-of-contents structures that are common in web crawls but lack informational content.
  • Non-natural-language detection: The symbol ratio, alphabetic character ratio, and average word length rules catch encoded data, markup, code snippets, and other non-prose content.
  • Degenerate text detection: The word count bounds and stop word requirements filter documents that are either too short to be useful or structurally abnormal for natural language (e.g., keyword-stuffed pages lacking function words).

The default stop word list for English is: the, be, to, of, and, that, have, with. These are among the most frequent function words in English and their absence is a strong signal that the text is not natural prose.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment