Implementation:NVIDIA NeMo Curator CommonCrawl URLGenerator

Knowledge Sources	NVIDIA NeMo Curator
Domains	URL Generation, Web Crawl, Common Crawl
Last Updated	2026-02-14 00:00 GMT

Overview

BaseCommonCrawlUrlGenerator, MainCommonCrawlUrlGenerator, and NewsCommonCrawlUrlGenerator generate lists of WARC file URLs for Common Crawl data, supporting both the Main crawl (by ISO-week) and the News crawl (by month).

Description

This module defines an abstract base class and two concrete implementations for generating URLs pointing to Common Crawl WARC files across specified date ranges.

BaseCommonCrawlUrlGenerator is an abstract dataclass extending URLGenerator that provides:

Date range parsing and validation from snapshot strings
The generate_data_urls method that fetches warc.paths.gz index files, decompresses them with zlib, and extracts individual WARC file URLs
Optional URL limiting via the limit parameter
Future date detection with automatic adjustment to today's date

MainCommonCrawlUrlGenerator handles the main Common Crawl corpus:

Uses ISO-week format for snapshot strings (YYYY-WW, e.g., "2020-50")
Fetches the Common Crawl index from index.commoncrawl.org/collinfo.json (cached via @cached_property)
Filters snapshots by date range, skipping unsupported pre-2013 archives
Generates paths in the format crawl-data/CC-MAIN-YYYY-WW/warc.paths.gz

NewsCommonCrawlUrlGenerator handles the CC-NEWS corpus:

Uses monthly format for snapshot strings (YYYY-MM, e.g., "2020-08")
Generates monthly paths starting from the earliest available news data (2016-08)
Generates paths in the format crawl-data/CC-NEWS/YYYY/MM/warc.paths.gz
Returns URLs in reverse chronological order to match Main crawl behavior

Usage

Use these classes to discover WARC file URLs for a given date range before downloading. They are typically used as the URL generation step in the Common Crawl download pipeline.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/download/common_crawl/url_generation.py
Lines: 1-255

Signature

@dataclass
class BaseCommonCrawlUrlGenerator(URLGenerator, ABC):
    start_snapshot_str: str
    end_snapshot_str: str
    data_prefix: str = "https://data.commoncrawl.org"
    limit: int | None = None

    def generate_urls(self) -> list[str]: ...
    def generate_data_urls(self, path_urls: str | list[str] | None = None) -> list[str]: ...
    def generate_path_urls(self) -> list[str]: ...  # abstract

@dataclass
class MainCommonCrawlUrlGenerator(BaseCommonCrawlUrlGenerator):
    index_prefix: str = "https://index.commoncrawl.org"
    def generate_path_urls(self) -> list[str]: ...

@dataclass
class NewsCommonCrawlUrlGenerator(BaseCommonCrawlUrlGenerator):
    def generate_path_urls(self) -> list[str]: ...

Import

from nemo_curator.stages.text.download.common_crawl.url_generation import (
    MainCommonCrawlUrlGenerator,
    NewsCommonCrawlUrlGenerator,
)

I/O Contract

Inputs

Name	Type	Required	Description
start_snapshot_str	str	Yes	Start of the date range. Format is YYYY-WW for Main crawl or YYYY-MM for News crawl
end_snapshot_str	str	Yes	End of the date range. Format is YYYY-WW for Main crawl or YYYY-MM for News crawl
data_prefix	str	No	Base URL prefix for data downloads. Defaults to "https://data.commoncrawl.org"
limit	int or None	No	Maximum number of WARC URLs to return. If None (default), returns all URLs
index_prefix	str	No	(MainCommonCrawlUrlGenerator only) Base URL for the CC index API. Defaults to "https://index.commoncrawl.org"

Outputs

Name	Type	Description
return value	list[str]	From `generate_urls()`: a list of fully-qualified URLs pointing to individual WARC files

Usage Examples

Main Common Crawl URLs

from nemo_curator.stages.text.download.common_crawl.url_generation import MainCommonCrawlUrlGenerator

# Generate WARC URLs for Main crawl snapshots from week 40 to week 50 of 2024
generator = MainCommonCrawlUrlGenerator(
    start_snapshot_str="2024-40",
    end_snapshot_str="2024-50",
)
urls = generator.generate_urls()
print(f"Found {len(urls)} WARC files")

News Common Crawl URLs with Limit

from nemo_curator.stages.text.download.common_crawl.url_generation import NewsCommonCrawlUrlGenerator

# Generate WARC URLs for News crawl from August to December 2024, limit to 100
generator = NewsCommonCrawlUrlGenerator(
    start_snapshot_str="2024-08",
    end_snapshot_str="2024-12",
    limit=100,
)
urls = generator.generate_urls()
print(f"Found {len(urls)} WARC files (limited to 100)")

Using generate_data_urls with Pre-fetched Path URLs

from nemo_curator.stages.text.download.common_crawl.url_generation import MainCommonCrawlUrlGenerator

generator = MainCommonCrawlUrlGenerator(
    start_snapshot_str="2024-40",
    end_snapshot_str="2024-50",
)

# First get the path URLs (warc.paths.gz files)
path_urls = generator.generate_path_urls()

# Then resolve them to individual WARC file URLs
warc_urls = generator.generate_data_urls(path_urls)

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_CommonCrawl_Downloader - Downloads the WARC files discovered by these generators
NVIDIA_NeMo_Curator_WARC_Iterator - Iterates over downloaded WARC files
NVIDIA_NeMo_Curator_CommonCrawlDownloadExtractStage - Orchestrates the full Common Crawl pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment