Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove URLFilter

From Leeroopedia
Knowledge Sources
Domains Data_Filtering, NLP, Web_Crawling
Type Filter Module
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete filter class that removes documents from a datatrove pipeline based on URL-level signals. Uses domain blocklists, URL blocklists, banned word/subword pattern matching, and soft-banned word counting to reject documents without examining their content.

Description

The URLFilter class extends BaseFilter and implements multi-layered URL-based rejection logic. On initialization, it loads curated blocklists from bundled assets (if use_integrated_lists=True) and builds an Aho-Corasick automaton for efficient substring matching of banned subwords.

The filter method processes each document through the following rejection cascade:

  1. Domain check -- Reject if the registered domain is in block_listed_domains
  2. Subdomain check -- Reject if the FQDN (including subdomains) is in block_listed_domains
  3. Full URL check -- Reject if the exact URL is in block_listed_url
  4. Banned word check -- Reject if any token from the URL (split on non-alphanumeric characters) matches banned_words
  5. Soft-banned word count -- Reject if the count of matching soft_banned_words tokens meets or exceeds soft_word_threshold
  6. Banned subword scan -- Reject if any entry in banned_subwords appears as a substring of the normalized URL (detected via the Aho-Corasick automaton)

Each rejection returns a reason string (e.g., "domain", "subdomain", "hard_blacklisted") for diagnostic tracking.

Usage

Use URLFilter as an early-stage filter in a datatrove pipeline, typically the first filter after reading documents from a web crawl source. It is computationally inexpensive and significantly reduces the volume of data flowing to more expensive downstream steps.

Code Reference

Source Location

Signature

class URLFilter(BaseFilter):
    name = "Url-filter"
    _requires_dependencies = ["tldextract", "fasteners", ("ahocorasick", "pyahocorasick")]

    def __init__(
        self,
        soft_word_threshold: int = 2,
        extra_domains: Iterable = None,
        extra_urls: Iterable = None,
        banned_words: Iterable = None,
        banned_subwords: Iterable = None,
        soft_banned_words: Iterable = None,
        use_integrated_lists: bool = True,
        exclusion_writer: DiskWriter = None,
    ):
        ...

    def filter(self, document: Document) -> bool | tuple[bool, str]:
        ...

Import

from datatrove.pipeline.filters import URLFilter

I/O Contract

Inputs

Name Type Required Description
soft_word_threshold int No (default: 2) Minimum count of soft-banned words in a URL to trigger rejection
extra_domains Iterable No (default: None) Additional domains to add to the blocklist
extra_urls Iterable No (default: None) Additional full URLs to add to the blocklist
banned_words Iterable No (default: None) Additional banned words (exact token match in URL)
banned_subwords Iterable No (default: None) Additional banned subwords (substring match via Aho-Corasick)
soft_banned_words Iterable No (default: None) Additional soft-banned words (counted, threshold-based rejection)
use_integrated_lists bool No (default: True) Whether to load bundled blocklists from datatrove assets
exclusion_writer DiskWriter No (default: None) Optional writer to save rejected documents for analysis

Pipeline Input: A Document object with a url field in its .metadata dictionary. The document must have metadata["url"] set; an assertion error is raised otherwise.

Outputs

Name Type Description
bool bool True if the document passes all URL checks and should be kept
(False, reason) tuple[bool, str] False with a reason string if the document is rejected

Rejection reasons: "domain", "subdomain", "url", "hard_blacklisted", "soft_blacklisted", "blacklisted_subword".

Usage Examples

Default Usage with Integrated Lists

from datatrove.pipeline.filters import URLFilter

# Uses bundled blocklists, default soft_word_threshold=2
url_filter = URLFilter()

Custom Blocklists

from datatrove.pipeline.filters import URLFilter

url_filter = URLFilter(
    extra_domains=["spam-domain.com", "unwanted-site.org"],
    extra_urls=["https://example.com/known-bad-page"],
    banned_words=["casino", "phishing"],
    soft_word_threshold=3,
)

With Exclusion Writer for Rejected Documents

from datatrove.pipeline.filters import URLFilter
from datatrove.pipeline.writers import JsonlWriter

url_filter = URLFilter(
    exclusion_writer=JsonlWriter("s3://my-bucket/rejected-urls/"),
    use_integrated_lists=True,
)

Related Pages

Principle:Huggingface_Datatrove_URL_Filtering

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment