Implementation:NVIDIA NeMo Curator CommonCrawl URLGenerator
| Knowledge Sources | |
|---|---|
| Domains | URL Generation, Web Crawl, Common Crawl |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
BaseCommonCrawlUrlGenerator, MainCommonCrawlUrlGenerator, and NewsCommonCrawlUrlGenerator generate lists of WARC file URLs for Common Crawl data, supporting both the Main crawl (by ISO-week) and the News crawl (by month).
Description
This module defines an abstract base class and two concrete implementations for generating URLs pointing to Common Crawl WARC files across specified date ranges.
BaseCommonCrawlUrlGenerator is an abstract dataclass extending URLGenerator that provides:
- Date range parsing and validation from snapshot strings
- The
generate_data_urlsmethod that fetcheswarc.paths.gzindex files, decompresses them with zlib, and extracts individual WARC file URLs - Optional URL limiting via the
limitparameter - Future date detection with automatic adjustment to today's date
MainCommonCrawlUrlGenerator handles the main Common Crawl corpus:
- Uses ISO-week format for snapshot strings (YYYY-WW, e.g., "2020-50")
- Fetches the Common Crawl index from
index.commoncrawl.org/collinfo.json(cached via@cached_property) - Filters snapshots by date range, skipping unsupported pre-2013 archives
- Generates paths in the format
crawl-data/CC-MAIN-YYYY-WW/warc.paths.gz
NewsCommonCrawlUrlGenerator handles the CC-NEWS corpus:
- Uses monthly format for snapshot strings (YYYY-MM, e.g., "2020-08")
- Generates monthly paths starting from the earliest available news data (2016-08)
- Generates paths in the format
crawl-data/CC-NEWS/YYYY/MM/warc.paths.gz - Returns URLs in reverse chronological order to match Main crawl behavior
Usage
Use these classes to discover WARC file URLs for a given date range before downloading. They are typically used as the URL generation step in the Common Crawl download pipeline.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/common_crawl/url_generation.py - Lines: 1-255
Signature
@dataclass
class BaseCommonCrawlUrlGenerator(URLGenerator, ABC):
start_snapshot_str: str
end_snapshot_str: str
data_prefix: str = "https://data.commoncrawl.org"
limit: int | None = None
def generate_urls(self) -> list[str]: ...
def generate_data_urls(self, path_urls: str | list[str] | None = None) -> list[str]: ...
def generate_path_urls(self) -> list[str]: ... # abstract
@dataclass
class MainCommonCrawlUrlGenerator(BaseCommonCrawlUrlGenerator):
index_prefix: str = "https://index.commoncrawl.org"
def generate_path_urls(self) -> list[str]: ...
@dataclass
class NewsCommonCrawlUrlGenerator(BaseCommonCrawlUrlGenerator):
def generate_path_urls(self) -> list[str]: ...
Import
from nemo_curator.stages.text.download.common_crawl.url_generation import (
MainCommonCrawlUrlGenerator,
NewsCommonCrawlUrlGenerator,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| start_snapshot_str | str | Yes | Start of the date range. Format is YYYY-WW for Main crawl or YYYY-MM for News crawl |
| end_snapshot_str | str | Yes | End of the date range. Format is YYYY-WW for Main crawl or YYYY-MM for News crawl |
| data_prefix | str | No | Base URL prefix for data downloads. Defaults to "https://data.commoncrawl.org" |
| limit | int or None | No | Maximum number of WARC URLs to return. If None (default), returns all URLs |
| index_prefix | str | No | (MainCommonCrawlUrlGenerator only) Base URL for the CC index API. Defaults to "https://index.commoncrawl.org" |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | list[str] | From generate_urls(): a list of fully-qualified URLs pointing to individual WARC files
|
Usage Examples
Main Common Crawl URLs
from nemo_curator.stages.text.download.common_crawl.url_generation import MainCommonCrawlUrlGenerator
# Generate WARC URLs for Main crawl snapshots from week 40 to week 50 of 2024
generator = MainCommonCrawlUrlGenerator(
start_snapshot_str="2024-40",
end_snapshot_str="2024-50",
)
urls = generator.generate_urls()
print(f"Found {len(urls)} WARC files")
News Common Crawl URLs with Limit
from nemo_curator.stages.text.download.common_crawl.url_generation import NewsCommonCrawlUrlGenerator
# Generate WARC URLs for News crawl from August to December 2024, limit to 100
generator = NewsCommonCrawlUrlGenerator(
start_snapshot_str="2024-08",
end_snapshot_str="2024-12",
limit=100,
)
urls = generator.generate_urls()
print(f"Found {len(urls)} WARC files (limited to 100)")
Using generate_data_urls with Pre-fetched Path URLs
from nemo_curator.stages.text.download.common_crawl.url_generation import MainCommonCrawlUrlGenerator
generator = MainCommonCrawlUrlGenerator(
start_snapshot_str="2024-40",
end_snapshot_str="2024-50",
)
# First get the path URLs (warc.paths.gz files)
path_urls = generator.generate_path_urls()
# Then resolve them to individual WARC file URLs
warc_urls = generator.generate_data_urls(path_urls)
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_CommonCrawl_Downloader - Downloads the WARC files discovered by these generators
- NVIDIA_NeMo_Curator_WARC_Iterator - Iterates over downloaded WARC files
- NVIDIA_NeMo_Curator_CommonCrawlDownloadExtractStage - Orchestrates the full Common Crawl pipeline