Principle:NVIDIA NeMo Curator Web Data Acquisition
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, NLP, Web_Crawling |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Technique for programmatically downloading and extracting text content from large-scale web archives and structured data sources for use in language model training.
Description
Web Data Acquisition is the process of systematically downloading raw data from public web archives (such as Common Crawl WARC files), academic repositories (arXiv), and encyclopedias (Wikipedia), then extracting clean text content from HTML or other markup formats. This addresses the fundamental challenge of sourcing diverse, large-scale text corpora for training language models. The process typically involves URL discovery, content downloading (handling compression, rate limiting, and retries), and HTML-to-text extraction using specialized parsers.
In NeMo Curator, this is implemented as composite download-extract stages that handle the full lifecycle: discovering source URLs, downloading compressed archives, and extracting plain text using configurable HTML extractors (jusText, resiliparse, or trafilatura).
Usage
Use this principle when building a text curation pipeline that needs to source training data from public web archives. It is the first step in any text data curation workflow and should be followed by content cleaning, quality filtering, and deduplication stages.
Theoretical Basis
Web data acquisition follows a three-phase pattern:
- URL Discovery: Enumerate available data sources (Common Crawl snapshots, arXiv dump URLs, Wikipedia database dumps)
- Content Download: Fetch raw content with retry logic, decompression (zstandard, gzip), and WARC record parsing
- Text Extraction: Convert HTML to plain text using boilerplate removal algorithms
Pseudo-code:
# Abstract acquisition algorithm
urls = discover_source_urls(source_type, start_date, end_date)
for url in urls:
raw_content = download_with_retry(url, max_retries=3)
decompressed = decompress(raw_content) # zstd, gzip
text = extract_text(decompressed, extractor="justext")
yield Document(text=text, url=url, source=source_type)