Implementation:NVIDIA NeMo Curator CommonCrawlDownloadExtractStage
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, NLP, Web_Crawling |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for downloading and extracting text from Common Crawl web archives provided by NeMo Curator.
Description
The CommonCrawlDownloadExtractStage is a composite processing stage that handles the complete pipeline of downloading Common Crawl WARC files and extracting plain text content. It combines URL discovery (from Common Crawl index), content downloading with compression handling (zstandard), and HTML-to-text extraction using configurable extractors (jusText, resiliparse, or trafilatura). Additional download stages ArxivDownloadExtractStage and WikipediaDownloadExtractStage follow the same pattern for their respective data sources.
Usage
Import this stage when you need to build a text curation pipeline that sources training data from Common Crawl web archives. Use ArxivDownloadExtractStage for academic papers or WikipediaDownloadExtractStage for encyclopedia content.
Code Reference
Source Location
- Repository: NeMo Curator
- File: nemo_curator/stages/text/download/common_crawl/stage.py
- Lines: L30-91
Signature
class CommonCrawlDownloadExtractStage(CompositeStage):
def __init__(
self,
output_dir: str,
crawl_urls: list[str] = None,
html_extractor: HTMLExtractor = None,
start_snapshot: str = None,
end_snapshot: str = None,
url_limit: int = None,
seed: int = None,
force_download: bool = False,
text_field: str = "text",
):
"""
Args:
output_dir: Base directory for downloaded/extracted text.
crawl_urls: Explicit list of WARC URLs to process.
html_extractor: HTML-to-text extractor (jusText/resiliparse/trafilatura).
start_snapshot: Start of Common Crawl snapshot range.
end_snapshot: End of Common Crawl snapshot range.
url_limit: Maximum number of URLs to process.
seed: Random seed for URL sampling.
force_download: Re-download even if files exist.
text_field: Column name for extracted text.
"""
Import
from nemo_curator.stages.text.download.common_crawl.stage import CommonCrawlDownloadExtractStage
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (sentinel) | _EmptyTask | Yes | Stage auto-discovers URLs from Common Crawl index |
Outputs
| Name | Type | Description |
|---|---|---|
| documents | DocumentBatch | DataFrame with text, url, language, source_id columns |
Usage Examples
Basic Common Crawl Download
from nemo_curator.stages.text.download.common_crawl.stage import CommonCrawlDownloadExtractStage
from nemo_curator.pipeline import Pipeline
# Create download stage for Common Crawl
download_stage = CommonCrawlDownloadExtractStage(
output_dir="./data/common_crawl",
start_snapshot="2024-01",
end_snapshot="2024-06",
url_limit=1000,
)
# Add to pipeline
pipeline = Pipeline()
pipeline.add_stage(download_stage)
pipeline.run()