Implementation:NVIDIA NeMo Curator CommonCrawl Extractor

Knowledge Sources	NVIDIA NeMo Curator
Domains	Text Extraction, Web Crawl, Common Crawl, NLP
Last Updated	2026-02-14 00:00 GMT

Overview

CommonCrawlHTMLExtractor extracts clean text from HTML content in Common Crawl WARC records, performing language detection and delegating to a pluggable boilerplate-removal algorithm.

Description

The CommonCrawlHTMLExtractor class extends DocumentExtractor and serves as the central text extraction component for the Common Crawl pipeline. It provides language-aware boilerplate removal with multiple algorithm choices.

The extraction pipeline for each record proceeds as follows:

HTML decoding: Raw bytes from the WARC record are decoded using decode_html, which attempts UTF-8 first and falls back to charset detection via charset_normalizer.
Language detection: The decoded HTML is passed through lang_detect (backed by pycld2) to determine the document language.
Stop list lookup: The detected language is used to look up a language-specific stop word set from the _stop_lists dictionary. If no stop list exists for the detected language, the record is skipped.
Text extraction: The chosen HTMLExtractorAlgorithm is called with the HTML, stop words, and language to produce a list of text paragraphs.
Output assembly: Extracted paragraphs are joined with double newlines and returned with metadata (url, warc_id, source_id, language).

Three built-in algorithm choices are supported:

justext (default) - JusText boilerplate removal
resiliparse - Resiliparse fast rule-based extractor
trafilatura - Trafilatura cascading extraction with fallbacks

The algorithm can be specified as a string name, an HTMLExtractorAlgorithm instance, or left as None to use the default (JusText).

Usage

Use this class when you need to extract readable text from Common Crawl HTML content. It is typically used after CommonCrawlWarcIterator has parsed WARC files into individual records containing raw HTML bytes.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/download/common_crawl/extract.py
Lines: 1-104

Signature

class CommonCrawlHTMLExtractor(DocumentExtractor):
    def __init__(
        self,
        algorithm: HTMLExtractorAlgorithm | str | None = None,
        algorithm_kwargs: dict | None = None,
        stop_lists: dict[str, frozenset[str]] | None = None,
    ): ...

    def extract(self, record: dict[str, Any]) -> dict[str, Any] | None: ...

    def input_columns(self) -> list[str]: ...

    def output_columns(self) -> list[str]: ...

Import

from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor

I/O Contract

Inputs

Name	Type	Required	Description
algorithm	HTMLExtractorAlgorithm or str or None	No	The HTML extraction algorithm to use. Accepts "justext", "resiliparse", "trafilatura" as strings, or an HTMLExtractorAlgorithm instance. Defaults to JusTextExtractor if None
algorithm_kwargs	dict or None	No	Keyword arguments passed to the algorithm constructor when algorithm is specified as a string
stop_lists	dict[str, frozenset[str]] or None	No	Dictionary mapping language names to frozensets of stop words. If None, loads the default stop list dictionary

The extract method accepts a record dict with the following input columns:

Name	Type	Required	Description
url	str	Yes	The URL of the original web page
warc_id	str	Yes	The WARC record identifier
source_id	str	Yes	The source filename identifier
content	bytes	Yes	Raw HTML content bytes from the WARC record

Outputs

Name	Type	Description
url	str	The URL of the original web page
warc_id	str	The WARC record identifier
source_id	str	The source filename identifier
language	str	Detected language of the document (uppercase, e.g. "ENGLISH")
text	str	Extracted clean text with paragraphs separated by double newlines

Returns None if the HTML content is empty, cannot be decoded, if no stop list exists for the detected language, or if no text is extracted.

Usage Examples

Basic Usage with Default Algorithm

from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor

# Uses JusText by default
extractor = CommonCrawlHTMLExtractor()

record = {
    "url": "https://example.com/page",
    "warc_id": "some-warc-id",
    "source_id": "CC-MAIN-2024-01.warc.gz",
    "content": b"<html><body><p>Hello world</p></body></html>",
}
result = extractor.extract(record)

Using a Specific Algorithm

from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor

# Use Trafilatura with custom settings
extractor = CommonCrawlHTMLExtractor(
    algorithm="trafilatura",
    algorithm_kwargs={"required_stopword_density": 0.25},
)

Using a Custom Algorithm Instance

from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor
from nemo_curator.stages.text.download.html_extractors.resiliparse import ResiliparseExtractor

extractor = CommonCrawlHTMLExtractor(
    algorithm=ResiliparseExtractor(
        required_stopword_density=0.30,
        main_content=True,
    ),
)

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_WARC_Iterator - Provides the record dicts consumed by this extractor
NVIDIA_NeMo_Curator_JusText_Extractor - Default HTML extraction algorithm
NVIDIA_NeMo_Curator_Resiliparse_Extractor - Alternative fast extraction algorithm
NVIDIA_NeMo_Curator_Trafilatura_Extractor - Alternative high-quality extraction algorithm
NVIDIA_NeMo_Curator_Download_Utils - Provides decode_html and lang_detect utilities
NVIDIA_NeMo_Curator_CommonCrawlDownloadExtractStage - Orchestrates the full Common Crawl pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment