Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator CommonCrawl URLGenerator

From Leeroopedia
Knowledge Sources
Domains URL Generation, Web Crawl, Common Crawl
Last Updated 2026-02-14 00:00 GMT

Overview

BaseCommonCrawlUrlGenerator, MainCommonCrawlUrlGenerator, and NewsCommonCrawlUrlGenerator generate lists of WARC file URLs for Common Crawl data, supporting both the Main crawl (by ISO-week) and the News crawl (by month).

Description

This module defines an abstract base class and two concrete implementations for generating URLs pointing to Common Crawl WARC files across specified date ranges.

BaseCommonCrawlUrlGenerator is an abstract dataclass extending URLGenerator that provides:

  • Date range parsing and validation from snapshot strings
  • The generate_data_urls method that fetches warc.paths.gz index files, decompresses them with zlib, and extracts individual WARC file URLs
  • Optional URL limiting via the limit parameter
  • Future date detection with automatic adjustment to today's date

MainCommonCrawlUrlGenerator handles the main Common Crawl corpus:

  • Uses ISO-week format for snapshot strings (YYYY-WW, e.g., "2020-50")
  • Fetches the Common Crawl index from index.commoncrawl.org/collinfo.json (cached via @cached_property)
  • Filters snapshots by date range, skipping unsupported pre-2013 archives
  • Generates paths in the format crawl-data/CC-MAIN-YYYY-WW/warc.paths.gz

NewsCommonCrawlUrlGenerator handles the CC-NEWS corpus:

  • Uses monthly format for snapshot strings (YYYY-MM, e.g., "2020-08")
  • Generates monthly paths starting from the earliest available news data (2016-08)
  • Generates paths in the format crawl-data/CC-NEWS/YYYY/MM/warc.paths.gz
  • Returns URLs in reverse chronological order to match Main crawl behavior

Usage

Use these classes to discover WARC file URLs for a given date range before downloading. They are typically used as the URL generation step in the Common Crawl download pipeline.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/common_crawl/url_generation.py
  • Lines: 1-255

Signature

@dataclass
class BaseCommonCrawlUrlGenerator(URLGenerator, ABC):
    start_snapshot_str: str
    end_snapshot_str: str
    data_prefix: str = "https://data.commoncrawl.org"
    limit: int | None = None

    def generate_urls(self) -> list[str]: ...
    def generate_data_urls(self, path_urls: str | list[str] | None = None) -> list[str]: ...
    def generate_path_urls(self) -> list[str]: ...  # abstract

@dataclass
class MainCommonCrawlUrlGenerator(BaseCommonCrawlUrlGenerator):
    index_prefix: str = "https://index.commoncrawl.org"
    def generate_path_urls(self) -> list[str]: ...

@dataclass
class NewsCommonCrawlUrlGenerator(BaseCommonCrawlUrlGenerator):
    def generate_path_urls(self) -> list[str]: ...

Import

from nemo_curator.stages.text.download.common_crawl.url_generation import (
    MainCommonCrawlUrlGenerator,
    NewsCommonCrawlUrlGenerator,
)

I/O Contract

Inputs

Name Type Required Description
start_snapshot_str str Yes Start of the date range. Format is YYYY-WW for Main crawl or YYYY-MM for News crawl
end_snapshot_str str Yes End of the date range. Format is YYYY-WW for Main crawl or YYYY-MM for News crawl
data_prefix str No Base URL prefix for data downloads. Defaults to "https://data.commoncrawl.org"
limit int or None No Maximum number of WARC URLs to return. If None (default), returns all URLs
index_prefix str No (MainCommonCrawlUrlGenerator only) Base URL for the CC index API. Defaults to "https://index.commoncrawl.org"

Outputs

Name Type Description
return value list[str] From generate_urls(): a list of fully-qualified URLs pointing to individual WARC files

Usage Examples

Main Common Crawl URLs

from nemo_curator.stages.text.download.common_crawl.url_generation import MainCommonCrawlUrlGenerator

# Generate WARC URLs for Main crawl snapshots from week 40 to week 50 of 2024
generator = MainCommonCrawlUrlGenerator(
    start_snapshot_str="2024-40",
    end_snapshot_str="2024-50",
)
urls = generator.generate_urls()
print(f"Found {len(urls)} WARC files")

News Common Crawl URLs with Limit

from nemo_curator.stages.text.download.common_crawl.url_generation import NewsCommonCrawlUrlGenerator

# Generate WARC URLs for News crawl from August to December 2024, limit to 100
generator = NewsCommonCrawlUrlGenerator(
    start_snapshot_str="2024-08",
    end_snapshot_str="2024-12",
    limit=100,
)
urls = generator.generate_urls()
print(f"Found {len(urls)} WARC files (limited to 100)")

Using generate_data_urls with Pre-fetched Path URLs

from nemo_curator.stages.text.download.common_crawl.url_generation import MainCommonCrawlUrlGenerator

generator = MainCommonCrawlUrlGenerator(
    start_snapshot_str="2024-40",
    end_snapshot_str="2024-50",
)

# First get the path URLs (warc.paths.gz files)
path_urls = generator.generate_path_urls()

# Then resolve them to individual WARC file URLs
warc_urls = generator.generate_data_urls(path_urls)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment