Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator JusText Extractor

From Leeroopedia
Knowledge Sources
Domains Text Extraction, Boilerplate Removal, HTML Processing, NLP
Last Updated 2026-02-14 00:00 GMT

Overview

JusTextExtractor implements HTML text extraction using the jusText boilerplate removal algorithm, which classifies HTML text blocks as content or boilerplate based on text length, stopword density, and link density.

Description

The JusTextExtractor class extends HTMLExtractorAlgorithm and is the default HTML extraction algorithm used by CommonCrawlHTMLExtractor. It wraps the jusText library to perform effective boilerplate removal from HTML pages.

The jusText algorithm operates in several stages:

  1. Segmentation: The HTML document is split into text blocks based on HTML tags that define separate sections (e.g., <div>, <p>, <table>).
  2. Preprocessing: Contents of <header>, <style>, and <script> tags are removed. Certain elements (e.g., <select>, copyright symbols) are immediately classified as boilerplate.
  3. Context-Free Classification: Each block is classified as:
    • Bad (boilerplate) if it has high link density
    • Short if it is too small to classify reliably
    • Near-Good if it has moderate stopword density
    • Good (main content) if it is long and contains many stopwords
  4. Context-Sensitive Classification: Short and near-good blocks are reclassified based on surrounding blocks, under the assumption that content and boilerplate tend to cluster together.
  5. Headings Processing: Header elements (e.g., <h1>, <h2>) are treated separately to preserve useful headings near content.

For non-spaced languages (Thai, Chinese, Japanese, Korean), the boilerplate check is automatically disabled since stopword density metrics are unreliable for these languages. A class-level set tracks which languages have already been warned about to avoid log spam.

Usage

Use this class when you need high-quality boilerplate removal from HTML pages, especially for building linguistic resources such as web corpora. It is particularly well-suited for preserving text containing full sentences. This is the default algorithm used by CommonCrawlHTMLExtractor.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/html_extractors/justext.py
  • Lines: 1-141

Signature

class JusTextExtractor(HTMLExtractorAlgorithm):
    _logged_languages: ClassVar[set[str]] = set()

    def __init__(
        self,
        length_low: int = 70,
        length_high: int = 200,
        stopwords_low: float = 0.30,
        stopwords_high: float = 0.32,
        max_link_density: float = 0.2,
        max_heading_distance: int = 200,
        no_headings: bool = False,
        is_boilerplate: bool | None = None,
    ): ...

    def extract_text(
        self,
        html: str,
        stop_words: frozenset[str],
        language: str,
    ) -> list[str] | None: ...

Import

from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor

I/O Contract

Inputs

Name Type Required Description
length_low int No Minimum text length threshold for context-free classification. Defaults to 70
length_high int No Maximum text length threshold for context-free classification. Defaults to 200
stopwords_low float No Lower stopword density threshold. Defaults to 0.30
stopwords_high float No Upper stopword density threshold. Defaults to 0.32
max_link_density float No Maximum allowed link density in a text block. Defaults to 0.2
max_heading_distance int No Maximum distance from a heading to consider for context-sensitive classification. Defaults to 200
no_headings bool No If True, ignores headings during extraction. Defaults to False
is_boilerplate bool or None No Controls boilerplate filtering. True filters boilerplate, False keeps all paragraphs. None (default) auto-selects based on language

The extract_text method accepts:

Name Type Required Description
html str Yes Decoded HTML content string
stop_words frozenset[str] Yes Language-specific stop word set
language str Yes Detected language name (uppercase, e.g., "ENGLISH")

Outputs

Name Type Description
return value list[str] or None List of extracted text paragraphs (non-boilerplate), or None if the HTML cannot be parsed

Usage Examples

Basic Usage

from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor

extractor = JusTextExtractor()

html = "<html><body><p>This is the main content of the page.</p></body></html>"
stop_words = frozenset(["the", "is", "of"])
paragraphs = extractor.extract_text(html, stop_words, "ENGLISH")

Custom Parameters

from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor

# More aggressive boilerplate removal
extractor = JusTextExtractor(
    length_low=100,
    length_high=300,
    stopwords_low=0.25,
    stopwords_high=0.40,
    max_link_density=0.1,
)

Disabling Boilerplate Check

from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor

# Keep all paragraphs regardless of boilerplate classification
extractor = JusTextExtractor(is_boilerplate=False)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment