Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator CommonCrawlDownloadExtractStage

From Leeroopedia
Knowledge Sources
Domains Data_Curation, NLP, Web_Crawling
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for downloading and extracting text from Common Crawl web archives provided by NeMo Curator.

Description

The CommonCrawlDownloadExtractStage is a composite processing stage that handles the complete pipeline of downloading Common Crawl WARC files and extracting plain text content. It combines URL discovery (from Common Crawl index), content downloading with compression handling (zstandard), and HTML-to-text extraction using configurable extractors (jusText, resiliparse, or trafilatura). Additional download stages ArxivDownloadExtractStage and WikipediaDownloadExtractStage follow the same pattern for their respective data sources.

Usage

Import this stage when you need to build a text curation pipeline that sources training data from Common Crawl web archives. Use ArxivDownloadExtractStage for academic papers or WikipediaDownloadExtractStage for encyclopedia content.

Code Reference

Source Location

  • Repository: NeMo Curator
  • File: nemo_curator/stages/text/download/common_crawl/stage.py
  • Lines: L30-91

Signature

class CommonCrawlDownloadExtractStage(CompositeStage):
    def __init__(
        self,
        output_dir: str,
        crawl_urls: list[str] = None,
        html_extractor: HTMLExtractor = None,
        start_snapshot: str = None,
        end_snapshot: str = None,
        url_limit: int = None,
        seed: int = None,
        force_download: bool = False,
        text_field: str = "text",
    ):
        """
        Args:
            output_dir: Base directory for downloaded/extracted text.
            crawl_urls: Explicit list of WARC URLs to process.
            html_extractor: HTML-to-text extractor (jusText/resiliparse/trafilatura).
            start_snapshot: Start of Common Crawl snapshot range.
            end_snapshot: End of Common Crawl snapshot range.
            url_limit: Maximum number of URLs to process.
            seed: Random seed for URL sampling.
            force_download: Re-download even if files exist.
            text_field: Column name for extracted text.
        """

Import

from nemo_curator.stages.text.download.common_crawl.stage import CommonCrawlDownloadExtractStage

I/O Contract

Inputs

Name Type Required Description
(sentinel) _EmptyTask Yes Stage auto-discovers URLs from Common Crawl index

Outputs

Name Type Description
documents DocumentBatch DataFrame with text, url, language, source_id columns

Usage Examples

Basic Common Crawl Download

from nemo_curator.stages.text.download.common_crawl.stage import CommonCrawlDownloadExtractStage
from nemo_curator.pipeline import Pipeline

# Create download stage for Common Crawl
download_stage = CommonCrawlDownloadExtractStage(
    output_dir="./data/common_crawl",
    start_snapshot="2024-01",
    end_snapshot="2024-06",
    url_limit=1000,
)

# Add to pipeline
pipeline = Pipeline()
pipeline.add_stage(download_stage)
pipeline.run()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment