Implementation:Huggingface Datatrove URLFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Filtering, NLP, Web_Crawling |
| Type | Filter Module |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete filter class that removes documents from a datatrove pipeline based on URL-level signals. Uses domain blocklists, URL blocklists, banned word/subword pattern matching, and soft-banned word counting to reject documents without examining their content.
Description
The URLFilter class extends BaseFilter and implements multi-layered URL-based rejection logic. On initialization, it loads curated blocklists from bundled assets (if use_integrated_lists=True) and builds an Aho-Corasick automaton for efficient substring matching of banned subwords.
The filter method processes each document through the following rejection cascade:
- Domain check -- Reject if the registered domain is in
block_listed_domains - Subdomain check -- Reject if the FQDN (including subdomains) is in
block_listed_domains - Full URL check -- Reject if the exact URL is in
block_listed_url - Banned word check -- Reject if any token from the URL (split on non-alphanumeric characters) matches
banned_words - Soft-banned word count -- Reject if the count of matching
soft_banned_wordstokens meets or exceedssoft_word_threshold - Banned subword scan -- Reject if any entry in
banned_subwordsappears as a substring of the normalized URL (detected via the Aho-Corasick automaton)
Each rejection returns a reason string (e.g., "domain", "subdomain", "hard_blacklisted") for diagnostic tracking.
Usage
Use URLFilter as an early-stage filter in a datatrove pipeline, typically the first filter after reading documents from a web crawl source. It is computationally inexpensive and significantly reduces the volume of data flowing to more expensive downstream steps.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/filters/url_filter.py
- Lines: 33-132
Signature
class URLFilter(BaseFilter):
name = "Url-filter"
_requires_dependencies = ["tldextract", "fasteners", ("ahocorasick", "pyahocorasick")]
def __init__(
self,
soft_word_threshold: int = 2,
extra_domains: Iterable = None,
extra_urls: Iterable = None,
banned_words: Iterable = None,
banned_subwords: Iterable = None,
soft_banned_words: Iterable = None,
use_integrated_lists: bool = True,
exclusion_writer: DiskWriter = None,
):
...
def filter(self, document: Document) -> bool | tuple[bool, str]:
...
Import
from datatrove.pipeline.filters import URLFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| soft_word_threshold | int | No (default: 2) | Minimum count of soft-banned words in a URL to trigger rejection |
| extra_domains | Iterable | No (default: None) | Additional domains to add to the blocklist |
| extra_urls | Iterable | No (default: None) | Additional full URLs to add to the blocklist |
| banned_words | Iterable | No (default: None) | Additional banned words (exact token match in URL) |
| banned_subwords | Iterable | No (default: None) | Additional banned subwords (substring match via Aho-Corasick) |
| soft_banned_words | Iterable | No (default: None) | Additional soft-banned words (counted, threshold-based rejection) |
| use_integrated_lists | bool | No (default: True) | Whether to load bundled blocklists from datatrove assets |
| exclusion_writer | DiskWriter | No (default: None) | Optional writer to save rejected documents for analysis |
Pipeline Input: A Document object with a url field in its .metadata dictionary. The document must have metadata["url"] set; an assertion error is raised otherwise.
Outputs
| Name | Type | Description |
|---|---|---|
| bool | bool | True if the document passes all URL checks and should be kept
|
| (False, reason) | tuple[bool, str] | False with a reason string if the document is rejected
|
Rejection reasons: "domain", "subdomain", "url", "hard_blacklisted", "soft_blacklisted", "blacklisted_subword".
Usage Examples
Default Usage with Integrated Lists
from datatrove.pipeline.filters import URLFilter
# Uses bundled blocklists, default soft_word_threshold=2
url_filter = URLFilter()
Custom Blocklists
from datatrove.pipeline.filters import URLFilter
url_filter = URLFilter(
extra_domains=["spam-domain.com", "unwanted-site.org"],
extra_urls=["https://example.com/known-bad-page"],
banned_words=["casino", "phishing"],
soft_word_threshold=3,
)
With Exclusion Writer for Rejected Documents
from datatrove.pipeline.filters import URLFilter
from datatrove.pipeline.writers import JsonlWriter
url_filter = URLFilter(
exclusion_writer=JsonlWriter("s3://my-bucket/rejected-urls/"),
use_integrated_lists=True,
)
Related Pages
- Huggingface_Datatrove_URL_Filtering (principle) -- The principle this implementation realizes
- Huggingface_Datatrove_Trafilatura (upstream step) -- HTML text extraction that typically follows URL filtering
- Huggingface_Datatrove_LanguageFilter (downstream filter) -- Language-based filtering applied after URL filtering