Implementation:NVIDIA NeMo Curator JusText Extractor

Knowledge Sources	NVIDIA NeMo Curator
Domains	Text Extraction, Boilerplate Removal, HTML Processing, NLP
Last Updated	2026-02-14 00:00 GMT

Overview

JusTextExtractor implements HTML text extraction using the jusText boilerplate removal algorithm, which classifies HTML text blocks as content or boilerplate based on text length, stopword density, and link density.

Description

The JusTextExtractor class extends HTMLExtractorAlgorithm and is the default HTML extraction algorithm used by CommonCrawlHTMLExtractor. It wraps the jusText library to perform effective boilerplate removal from HTML pages.

The jusText algorithm operates in several stages:

Segmentation: The HTML document is split into text blocks based on HTML tags that define separate sections (e.g., <div>, <p>, <table>).
Preprocessing: Contents of <header>, <style>, and <script> tags are removed. Certain elements (e.g., <select>, copyright symbols) are immediately classified as boilerplate.
Context-Free Classification: Each block is classified as:
- Bad (boilerplate) if it has high link density
- Short if it is too small to classify reliably
- Near-Good if it has moderate stopword density
- Good (main content) if it is long and contains many stopwords
Context-Sensitive Classification: Short and near-good blocks are reclassified based on surrounding blocks, under the assumption that content and boilerplate tend to cluster together.
Headings Processing: Header elements (e.g., <h1>, <h2>) are treated separately to preserve useful headings near content.

For non-spaced languages (Thai, Chinese, Japanese, Korean), the boilerplate check is automatically disabled since stopword density metrics are unreliable for these languages. A class-level set tracks which languages have already been warned about to avoid log spam.

Usage

Use this class when you need high-quality boilerplate removal from HTML pages, especially for building linguistic resources such as web corpora. It is particularly well-suited for preserving text containing full sentences. This is the default algorithm used by CommonCrawlHTMLExtractor.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/download/html_extractors/justext.py
Lines: 1-141

Signature

class JusTextExtractor(HTMLExtractorAlgorithm):
    _logged_languages: ClassVar[set[str]] = set()

    def __init__(
        self,
        length_low: int = 70,
        length_high: int = 200,
        stopwords_low: float = 0.30,
        stopwords_high: float = 0.32,
        max_link_density: float = 0.2,
        max_heading_distance: int = 200,
        no_headings: bool = False,
        is_boilerplate: bool | None = None,
    ): ...

    def extract_text(
        self,
        html: str,
        stop_words: frozenset[str],
        language: str,
    ) -> list[str] | None: ...

Import

from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor

I/O Contract

Inputs

Name	Type	Required	Description
length_low	int	No	Minimum text length threshold for context-free classification. Defaults to 70
length_high	int	No	Maximum text length threshold for context-free classification. Defaults to 200
stopwords_low	float	No	Lower stopword density threshold. Defaults to 0.30
stopwords_high	float	No	Upper stopword density threshold. Defaults to 0.32
max_link_density	float	No	Maximum allowed link density in a text block. Defaults to 0.2
max_heading_distance	int	No	Maximum distance from a heading to consider for context-sensitive classification. Defaults to 200
no_headings	bool	No	If True, ignores headings during extraction. Defaults to False
is_boilerplate	bool or None	No	Controls boilerplate filtering. True filters boilerplate, False keeps all paragraphs. None (default) auto-selects based on language

The extract_text method accepts:

Name	Type	Required	Description
html	str	Yes	Decoded HTML content string
stop_words	frozenset[str]	Yes	Language-specific stop word set
language	str	Yes	Detected language name (uppercase, e.g., "ENGLISH")

Outputs

Name	Type	Description
return value	list[str] or None	List of extracted text paragraphs (non-boilerplate), or None if the HTML cannot be parsed

Usage Examples

Basic Usage

from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor

extractor = JusTextExtractor()

html = "<html><body><p>This is the main content of the page.</p></body></html>"
stop_words = frozenset(["the", "is", "of"])
paragraphs = extractor.extract_text(html, stop_words, "ENGLISH")

Custom Parameters

from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor

# More aggressive boilerplate removal
extractor = JusTextExtractor(
    length_low=100,
    length_high=300,
    stopwords_low=0.25,
    stopwords_high=0.40,
    max_link_density=0.1,
)

Disabling Boilerplate Check

from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor

# Keep all paragraphs regardless of boilerplate classification
extractor = JusTextExtractor(is_boilerplate=False)

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_CommonCrawl_Extractor - Uses this as the default HTML extraction algorithm
NVIDIA_NeMo_Curator_Resiliparse_Extractor - Alternative fast extraction algorithm
NVIDIA_NeMo_Curator_Trafilatura_Extractor - Alternative high-quality extraction algorithm

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment