Implementation:Huggingface Datatrove URLFilter

Knowledge Sources	Huggingface_Datatrove
Domains	Data_Filtering, NLP, Web_Crawling
Type	Filter Module
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete filter class that removes documents from a datatrove pipeline based on URL-level signals. Uses domain blocklists, URL blocklists, banned word/subword pattern matching, and soft-banned word counting to reject documents without examining their content.

Description

The URLFilter class extends BaseFilter and implements multi-layered URL-based rejection logic. On initialization, it loads curated blocklists from bundled assets (if use_integrated_lists=True) and builds an Aho-Corasick automaton for efficient substring matching of banned subwords.

The filter method processes each document through the following rejection cascade:

Domain check -- Reject if the registered domain is in block_listed_domains
Subdomain check -- Reject if the FQDN (including subdomains) is in block_listed_domains
Full URL check -- Reject if the exact URL is in block_listed_url
Banned word check -- Reject if any token from the URL (split on non-alphanumeric characters) matches banned_words
Soft-banned word count -- Reject if the count of matching soft_banned_words tokens meets or exceeds soft_word_threshold
Banned subword scan -- Reject if any entry in banned_subwords appears as a substring of the normalized URL (detected via the Aho-Corasick automaton)

Each rejection returns a reason string (e.g., "domain", "subdomain", "hard_blacklisted") for diagnostic tracking.

Usage

Use URLFilter as an early-stage filter in a datatrove pipeline, typically the first filter after reading documents from a web crawl source. It is computationally inexpensive and significantly reduces the volume of data flowing to more expensive downstream steps.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/filters/url_filter.py
Lines: 33-132

Signature

class URLFilter(BaseFilter):
    name = "Url-filter"
    _requires_dependencies = ["tldextract", "fasteners", ("ahocorasick", "pyahocorasick")]

    def __init__(
        self,
        soft_word_threshold: int = 2,
        extra_domains: Iterable = None,
        extra_urls: Iterable = None,
        banned_words: Iterable = None,
        banned_subwords: Iterable = None,
        soft_banned_words: Iterable = None,
        use_integrated_lists: bool = True,
        exclusion_writer: DiskWriter = None,
    ):
        ...

    def filter(self, document: Document) -> bool | tuple[bool, str]:
        ...

Import

from datatrove.pipeline.filters import URLFilter

I/O Contract

Inputs

Name	Type	Required	Description
soft_word_threshold	int	No (default: 2)	Minimum count of soft-banned words in a URL to trigger rejection
extra_domains	Iterable	No (default: None)	Additional domains to add to the blocklist
extra_urls	Iterable	No (default: None)	Additional full URLs to add to the blocklist
banned_words	Iterable	No (default: None)	Additional banned words (exact token match in URL)
banned_subwords	Iterable	No (default: None)	Additional banned subwords (substring match via Aho-Corasick)
soft_banned_words	Iterable	No (default: None)	Additional soft-banned words (counted, threshold-based rejection)
use_integrated_lists	bool	No (default: True)	Whether to load bundled blocklists from datatrove assets
exclusion_writer	DiskWriter	No (default: None)	Optional writer to save rejected documents for analysis

Pipeline Input: A Document object with a url field in its .metadata dictionary. The document must have metadata["url"] set; an assertion error is raised otherwise.

Outputs

Name	Type	Description
bool	bool	`True` if the document passes all URL checks and should be kept
(False, reason)	tuple[bool, str]	`False` with a reason string if the document is rejected

Rejection reasons: "domain", "subdomain", "url", "hard_blacklisted", "soft_blacklisted", "blacklisted_subword".

Usage Examples

Default Usage with Integrated Lists

from datatrove.pipeline.filters import URLFilter

# Uses bundled blocklists, default soft_word_threshold=2
url_filter = URLFilter()

Custom Blocklists

from datatrove.pipeline.filters import URLFilter

url_filter = URLFilter(
    extra_domains=["spam-domain.com", "unwanted-site.org"],
    extra_urls=["https://example.com/known-bad-page"],
    banned_words=["casino", "phishing"],
    soft_word_threshold=3,
)

With Exclusion Writer for Rejected Documents

from datatrove.pipeline.filters import URLFilter
from datatrove.pipeline.writers import JsonlWriter

url_filter = URLFilter(
    exclusion_writer=JsonlWriter("s3://my-bucket/rejected-urls/"),
    use_integrated_lists=True,
)

Related Pages

Huggingface_Datatrove_URL_Filtering (principle) -- The principle this implementation realizes
Huggingface_Datatrove_Trafilatura (upstream step) -- HTML text extraction that typically follows URL filtering
Huggingface_Datatrove_LanguageFilter (downstream filter) -- Language-based filtering applied after URL filtering

Principle:Huggingface_Datatrove_URL_Filtering

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment