Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator CommonCrawl Extractor

From Leeroopedia
Revision as of 13:20, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/NVIDIA_NeMo_Curator_CommonCrawl_Extractor.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Text Extraction, Web Crawl, Common Crawl, NLP
Last Updated 2026-02-14 00:00 GMT

Overview

CommonCrawlHTMLExtractor extracts clean text from HTML content in Common Crawl WARC records, performing language detection and delegating to a pluggable boilerplate-removal algorithm.

Description

The CommonCrawlHTMLExtractor class extends DocumentExtractor and serves as the central text extraction component for the Common Crawl pipeline. It provides language-aware boilerplate removal with multiple algorithm choices.

The extraction pipeline for each record proceeds as follows:

  1. HTML decoding: Raw bytes from the WARC record are decoded using decode_html, which attempts UTF-8 first and falls back to charset detection via charset_normalizer.
  2. Language detection: The decoded HTML is passed through lang_detect (backed by pycld2) to determine the document language.
  3. Stop list lookup: The detected language is used to look up a language-specific stop word set from the _stop_lists dictionary. If no stop list exists for the detected language, the record is skipped.
  4. Text extraction: The chosen HTMLExtractorAlgorithm is called with the HTML, stop words, and language to produce a list of text paragraphs.
  5. Output assembly: Extracted paragraphs are joined with double newlines and returned with metadata (url, warc_id, source_id, language).

Three built-in algorithm choices are supported:

  • justext (default) - JusText boilerplate removal
  • resiliparse - Resiliparse fast rule-based extractor
  • trafilatura - Trafilatura cascading extraction with fallbacks

The algorithm can be specified as a string name, an HTMLExtractorAlgorithm instance, or left as None to use the default (JusText).

Usage

Use this class when you need to extract readable text from Common Crawl HTML content. It is typically used after CommonCrawlWarcIterator has parsed WARC files into individual records containing raw HTML bytes.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/common_crawl/extract.py
  • Lines: 1-104

Signature

class CommonCrawlHTMLExtractor(DocumentExtractor):
    def __init__(
        self,
        algorithm: HTMLExtractorAlgorithm | str | None = None,
        algorithm_kwargs: dict | None = None,
        stop_lists: dict[str, frozenset[str]] | None = None,
    ): ...

    def extract(self, record: dict[str, Any]) -> dict[str, Any] | None: ...

    def input_columns(self) -> list[str]: ...

    def output_columns(self) -> list[str]: ...

Import

from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor

I/O Contract

Inputs

Name Type Required Description
algorithm HTMLExtractorAlgorithm or str or None No The HTML extraction algorithm to use. Accepts "justext", "resiliparse", "trafilatura" as strings, or an HTMLExtractorAlgorithm instance. Defaults to JusTextExtractor if None
algorithm_kwargs dict or None No Keyword arguments passed to the algorithm constructor when algorithm is specified as a string
stop_lists dict[str, frozenset[str]] or None No Dictionary mapping language names to frozensets of stop words. If None, loads the default stop list dictionary

The extract method accepts a record dict with the following input columns:

Name Type Required Description
url str Yes The URL of the original web page
warc_id str Yes The WARC record identifier
source_id str Yes The source filename identifier
content bytes Yes Raw HTML content bytes from the WARC record

Outputs

Name Type Description
url str The URL of the original web page
warc_id str The WARC record identifier
source_id str The source filename identifier
language str Detected language of the document (uppercase, e.g. "ENGLISH")
text str Extracted clean text with paragraphs separated by double newlines

Returns None if the HTML content is empty, cannot be decoded, if no stop list exists for the detected language, or if no text is extracted.

Usage Examples

Basic Usage with Default Algorithm

from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor

# Uses JusText by default
extractor = CommonCrawlHTMLExtractor()

record = {
    "url": "https://example.com/page",
    "warc_id": "some-warc-id",
    "source_id": "CC-MAIN-2024-01.warc.gz",
    "content": b"<html><body><p>Hello world</p></body></html>",
}
result = extractor.extract(record)

Using a Specific Algorithm

from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor

# Use Trafilatura with custom settings
extractor = CommonCrawlHTMLExtractor(
    algorithm="trafilatura",
    algorithm_kwargs={"required_stopword_density": 0.25},
)

Using a Custom Algorithm Instance

from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor
from nemo_curator.stages.text.download.html_extractors.resiliparse import ResiliparseExtractor

extractor = CommonCrawlHTMLExtractor(
    algorithm=ResiliparseExtractor(
        required_stopword_density=0.30,
        main_content=True,
    ),
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment