Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Datatrove URL Filtering

From Leeroopedia
Property Value
Principle Name URL_Filtering
Overview Removing documents based on URL-level signals including domain blocklists and banned word patterns
Domains Data_Filtering, NLP, Web_Crawling
Related Implementation Huggingface_Datatrove_URLFilter
Knowledge Sources Huggingface_Datatrove
Last Updated 2026-02-14 00:00 GMT

Overview

URL filtering is a preprocessing step that removes documents from a web crawl dataset based on URL-level signals, without examining the document content itself. Documents are rejected if their domain appears on a blocklist, their URL contains banned words or subwords, or their URL accumulates too many "soft-banned" word matches. This provides a fast, content-independent first pass that eliminates clearly unwanted sources before more expensive content-level analysis.

Description

URL filtering operates on the URL metadata of each document and applies multiple layers of rejection criteria:

  • Domain blocklist -- The registered domain (e.g., example.com) is checked against a curated set of blocked domains. Both the registered domain and the fully-qualified domain name (FQDN, including subdomains) are tested.
  • URL blocklist -- The full URL string is checked against a set of known blocked URLs.
  • Banned words -- The URL is tokenized by splitting on non-alphanumeric characters, and each token is checked against a set of banned words. A single match causes rejection.
  • Banned subwords -- Using an Aho-Corasick automaton for efficient multi-pattern substring matching, the normalized URL is scanned for banned substrings. This catches patterns that span URL token boundaries.
  • Soft-banned words -- Similar to banned words, but a configurable threshold (default: 2) of matching soft-banned words is required before rejection. This handles words that are individually ambiguous but collectively indicative of unwanted content.

The filter relies on curated blocklists that are shipped as bundled assets within the datatrove package and can be extended with user-supplied lists.

Usage

URL filtering is applied as the first filter step after reading web crawl data, before any content-level filtering. Because it operates only on URL metadata, it is computationally inexpensive and can quickly reduce the volume of data that must undergo more costly text-based analysis.

Typical pipeline position:

  1. Read raw documents from WARC or other source
  2. URL filter (this principle)
  3. HTML text extraction
  4. Language filtering
  5. Quality/content filtering

Theoretical Basis

The theoretical underpinning of URL filtering draws on several concepts:

  • Blocklist-based filtering -- A well-established technique in web content filtering. Curated domain and URL blocklists encode human editorial judgment about which sources are inappropriate for a given corpus.
  • Aho-Corasick automaton -- A finite-state machine that efficiently matches multiple patterns against a single input string in linear time. Given a set of k patterns with total length m, the automaton is built in O(m) time and matches against a string of length n in O(n + number_of_matches) time. This is critical for performance when scanning millions of URLs against hundreds of banned subword patterns.
  • Soft thresholding -- Rather than treating each signal as a hard reject, soft-banned words use a count threshold. This allows words that are individually benign but collectively suspicious to trigger rejection, reducing both false positives and false negatives.
  • URL normalization -- URLs are normalized by lowercasing and stripping non-alphanumeric characters before matching. This prevents circumvention through case variation or URL encoding tricks.
  • TLD extraction -- The tldextract library parses URLs into their registered domain, subdomain, and suffix components, enabling accurate domain-level matching even for complex public suffix patterns (e.g., co.uk).

Related Pages

Implementation:Huggingface_Datatrove_URLFilter

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment