Implementation:NVIDIA NeMo Curator CommonCrawl Extractor
| Knowledge Sources | |
|---|---|
| Domains | Text Extraction, Web Crawl, Common Crawl, NLP |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
CommonCrawlHTMLExtractor extracts clean text from HTML content in Common Crawl WARC records, performing language detection and delegating to a pluggable boilerplate-removal algorithm.
Description
The CommonCrawlHTMLExtractor class extends DocumentExtractor and serves as the central text extraction component for the Common Crawl pipeline. It provides language-aware boilerplate removal with multiple algorithm choices.
The extraction pipeline for each record proceeds as follows:
- HTML decoding: Raw bytes from the WARC record are decoded using
decode_html, which attempts UTF-8 first and falls back to charset detection viacharset_normalizer. - Language detection: The decoded HTML is passed through
lang_detect(backed by pycld2) to determine the document language. - Stop list lookup: The detected language is used to look up a language-specific stop word set from the
_stop_listsdictionary. If no stop list exists for the detected language, the record is skipped. - Text extraction: The chosen
HTMLExtractorAlgorithmis called with the HTML, stop words, and language to produce a list of text paragraphs. - Output assembly: Extracted paragraphs are joined with double newlines and returned with metadata (url, warc_id, source_id, language).
Three built-in algorithm choices are supported:
- justext (default) - JusText boilerplate removal
- resiliparse - Resiliparse fast rule-based extractor
- trafilatura - Trafilatura cascading extraction with fallbacks
The algorithm can be specified as a string name, an HTMLExtractorAlgorithm instance, or left as None to use the default (JusText).
Usage
Use this class when you need to extract readable text from Common Crawl HTML content. It is typically used after CommonCrawlWarcIterator has parsed WARC files into individual records containing raw HTML bytes.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/common_crawl/extract.py - Lines: 1-104
Signature
class CommonCrawlHTMLExtractor(DocumentExtractor):
def __init__(
self,
algorithm: HTMLExtractorAlgorithm | str | None = None,
algorithm_kwargs: dict | None = None,
stop_lists: dict[str, frozenset[str]] | None = None,
): ...
def extract(self, record: dict[str, Any]) -> dict[str, Any] | None: ...
def input_columns(self) -> list[str]: ...
def output_columns(self) -> list[str]: ...
Import
from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| algorithm | HTMLExtractorAlgorithm or str or None | No | The HTML extraction algorithm to use. Accepts "justext", "resiliparse", "trafilatura" as strings, or an HTMLExtractorAlgorithm instance. Defaults to JusTextExtractor if None |
| algorithm_kwargs | dict or None | No | Keyword arguments passed to the algorithm constructor when algorithm is specified as a string |
| stop_lists | dict[str, frozenset[str]] or None | No | Dictionary mapping language names to frozensets of stop words. If None, loads the default stop list dictionary |
The extract method accepts a record dict with the following input columns:
| Name | Type | Required | Description |
|---|---|---|---|
| url | str | Yes | The URL of the original web page |
| warc_id | str | Yes | The WARC record identifier |
| source_id | str | Yes | The source filename identifier |
| content | bytes | Yes | Raw HTML content bytes from the WARC record |
Outputs
| Name | Type | Description |
|---|---|---|
| url | str | The URL of the original web page |
| warc_id | str | The WARC record identifier |
| source_id | str | The source filename identifier |
| language | str | Detected language of the document (uppercase, e.g. "ENGLISH") |
| text | str | Extracted clean text with paragraphs separated by double newlines |
Returns None if the HTML content is empty, cannot be decoded, if no stop list exists for the detected language, or if no text is extracted.
Usage Examples
Basic Usage with Default Algorithm
from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor
# Uses JusText by default
extractor = CommonCrawlHTMLExtractor()
record = {
"url": "https://example.com/page",
"warc_id": "some-warc-id",
"source_id": "CC-MAIN-2024-01.warc.gz",
"content": b"<html><body><p>Hello world</p></body></html>",
}
result = extractor.extract(record)
Using a Specific Algorithm
from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor
# Use Trafilatura with custom settings
extractor = CommonCrawlHTMLExtractor(
algorithm="trafilatura",
algorithm_kwargs={"required_stopword_density": 0.25},
)
Using a Custom Algorithm Instance
from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor
from nemo_curator.stages.text.download.html_extractors.resiliparse import ResiliparseExtractor
extractor = CommonCrawlHTMLExtractor(
algorithm=ResiliparseExtractor(
required_stopword_density=0.30,
main_content=True,
),
)
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_WARC_Iterator - Provides the record dicts consumed by this extractor
- NVIDIA_NeMo_Curator_JusText_Extractor - Default HTML extraction algorithm
- NVIDIA_NeMo_Curator_Resiliparse_Extractor - Alternative fast extraction algorithm
- NVIDIA_NeMo_Curator_Trafilatura_Extractor - Alternative high-quality extraction algorithm
- NVIDIA_NeMo_Curator_Download_Utils - Provides decode_html and lang_detect utilities
- NVIDIA_NeMo_Curator_CommonCrawlDownloadExtractStage - Orchestrates the full Common Crawl pipeline