Implementation:NVIDIA NeMo Curator CommonCrawlDownloadExtractStage

Knowledge Sources	NeMo Curator NeMo Curator Docs
Domains	Data_Curation, NLP, Web_Crawling
Last Updated	2026-02-14 17:00 GMT

Overview

Concrete tool for downloading and extracting text from Common Crawl web archives provided by NeMo Curator.

Description

The CommonCrawlDownloadExtractStage is a composite processing stage that handles the complete pipeline of downloading Common Crawl WARC files and extracting plain text content. It combines URL discovery (from Common Crawl index), content downloading with compression handling (zstandard), and HTML-to-text extraction using configurable extractors (jusText, resiliparse, or trafilatura). Additional download stages ArxivDownloadExtractStage and WikipediaDownloadExtractStage follow the same pattern for their respective data sources.

Usage

Import this stage when you need to build a text curation pipeline that sources training data from Common Crawl web archives. Use ArxivDownloadExtractStage for academic papers or WikipediaDownloadExtractStage for encyclopedia content.

Code Reference

Source Location

Repository: NeMo Curator
File: nemo_curator/stages/text/download/common_crawl/stage.py
Lines: L30-91

Signature

class CommonCrawlDownloadExtractStage(CompositeStage):
    def __init__(
        self,
        output_dir: str,
        crawl_urls: list[str] = None,
        html_extractor: HTMLExtractor = None,
        start_snapshot: str = None,
        end_snapshot: str = None,
        url_limit: int = None,
        seed: int = None,
        force_download: bool = False,
        text_field: str = "text",
    ):
        """
        Args:
            output_dir: Base directory for downloaded/extracted text.
            crawl_urls: Explicit list of WARC URLs to process.
            html_extractor: HTML-to-text extractor (jusText/resiliparse/trafilatura).
            start_snapshot: Start of Common Crawl snapshot range.
            end_snapshot: End of Common Crawl snapshot range.
            url_limit: Maximum number of URLs to process.
            seed: Random seed for URL sampling.
            force_download: Re-download even if files exist.
            text_field: Column name for extracted text.
        """

Import

from nemo_curator.stages.text.download.common_crawl.stage import CommonCrawlDownloadExtractStage

I/O Contract

Inputs

Name	Type	Required	Description
(sentinel)	_EmptyTask	Yes	Stage auto-discovers URLs from Common Crawl index

Outputs

Name	Type	Description
documents	DocumentBatch	DataFrame with text, url, language, source_id columns

Usage Examples

Basic Common Crawl Download

from nemo_curator.stages.text.download.common_crawl.stage import CommonCrawlDownloadExtractStage
from nemo_curator.pipeline import Pipeline

# Create download stage for Common Crawl
download_stage = CommonCrawlDownloadExtractStage(
    output_dir="./data/common_crawl",
    start_snapshot="2024-01",
    end_snapshot="2024-06",
    url_limit=1000,
)

# Add to pipeline
pipeline = Pipeline()
pipeline.add_stage(download_stage)
pipeline.run()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment