Principle:NVIDIA NeMo Curator Web Data Acquisition

Knowledge Sources	NeMo Curator Docs NeMo Curator CCNet
Domains	Data_Curation, NLP, Web_Crawling
Last Updated	2026-02-14 17:00 GMT

Overview

Technique for programmatically downloading and extracting text content from large-scale web archives and structured data sources for use in language model training.

Description

Web Data Acquisition is the process of systematically downloading raw data from public web archives (such as Common Crawl WARC files), academic repositories (arXiv), and encyclopedias (Wikipedia), then extracting clean text content from HTML or other markup formats. This addresses the fundamental challenge of sourcing diverse, large-scale text corpora for training language models. The process typically involves URL discovery, content downloading (handling compression, rate limiting, and retries), and HTML-to-text extraction using specialized parsers.

In NeMo Curator, this is implemented as composite download-extract stages that handle the full lifecycle: discovering source URLs, downloading compressed archives, and extracting plain text using configurable HTML extractors (jusText, resiliparse, or trafilatura).

Usage

Use this principle when building a text curation pipeline that needs to source training data from public web archives. It is the first step in any text data curation workflow and should be followed by content cleaning, quality filtering, and deduplication stages.

Theoretical Basis

Web data acquisition follows a three-phase pattern:

URL Discovery: Enumerate available data sources (Common Crawl snapshots, arXiv dump URLs, Wikipedia database dumps)
Content Download: Fetch raw content with retry logic, decompression (zstandard, gzip), and WARC record parsing
Text Extraction: Convert HTML to plain text using boilerplate removal algorithms

Pseudo-code:

# Abstract acquisition algorithm
urls = discover_source_urls(source_type, start_date, end_date)
for url in urls:
    raw_content = download_with_retry(url, max_retries=3)
    decompressed = decompress(raw_content)  # zstd, gzip
    text = extract_text(decompressed, extractor="justext")
    yield Document(text=text, url=url, source=source_type)

Related Pages

Implemented By

Implementation:NVIDIA_NeMo_Curator_CommonCrawlDownloadExtractStage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment