Implementation:NVIDIA NeMo Curator UrlRemover
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, Text_Cleaning, URL_Processing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
UrlRemover is a document modifier that removes all URLs (matching http://, https://, or www. patterns) from document text using a case-insensitive regular expression.
Description
UrlRemover extends DocumentModifier and uses a precompiled case-insensitive regex pattern to detect and strip URLs from text. The regex https?://\S+|www\.\S+ matches two patterns:
- Protocol-based URLs: Strings starting with
http://orhttps://followed by one or more non-whitespace characters. - www-based URLs: Strings starting with
www.followed by one or more non-whitespace characters.
The regex is compiled at module load time with the re.IGNORECASE flag for efficiency. In modify_document(), all matches are replaced with an empty string, effectively removing the URLs entirely. The surrounding text (including any spaces adjacent to the URL) is left intact.
Usage
Use UrlRemover when cleaning web-scraped text that contains URLs which are not useful for downstream NLP tasks. This is a common preprocessing step in text normalization pipelines for training data preparation.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/modifiers/url_remover.py - Lines: 1-32
Signature
class UrlRemover(DocumentModifier):
def __init__(self):
...
def modify_document(self, text: str) -> str:
...
Module-Level Constants
URL_REGEX = re.compile(r"https?://\S+|www\.\S+", flags=re.IGNORECASE)
Import
from nemo_curator.stages.text.modifiers.url_remover import UrlRemover
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str |
Yes | The document text passed to modify_document().
|
Outputs
| Name | Type | Description |
|---|---|---|
| return value | str |
The document text with all matching URLs removed (replaced with empty strings). |
Usage Examples
Basic Usage
from nemo_curator.stages.text.modifiers.url_remover import UrlRemover
modifier = UrlRemover()
text = "Visit https://example.com for more info or check www.docs.example.com"
result = modifier.modify_document(text)
# Returns "Visit for more info or check "
Mixed Content
from nemo_curator.stages.text.modifiers.url_remover import UrlRemover
modifier = UrlRemover()
text = """Article about Python.
See http://docs.python.org/3/ for documentation.
Also check HTTP://EXAMPLE.COM for case-insensitive matching.
No URL on this line."""
result = modifier.modify_document(text)
# All URLs (including case variations) are removed
# Lines without URLs remain unchanged
Related Pages
- NVIDIA_NeMo_Curator_DocumentModifier — Base class that this modifier extends
- NVIDIA_NeMo_Curator_MarkdownRemover — Complementary modifier that strips Markdown link syntax (preserving the URL)
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base