Implementation:NVIDIA NeMo Curator JusText Extractor
| Knowledge Sources | |
|---|---|
| Domains | Text Extraction, Boilerplate Removal, HTML Processing, NLP |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
JusTextExtractor implements HTML text extraction using the jusText boilerplate removal algorithm, which classifies HTML text blocks as content or boilerplate based on text length, stopword density, and link density.
Description
The JusTextExtractor class extends HTMLExtractorAlgorithm and is the default HTML extraction algorithm used by CommonCrawlHTMLExtractor. It wraps the jusText library to perform effective boilerplate removal from HTML pages.
The jusText algorithm operates in several stages:
- Segmentation: The HTML document is split into text blocks based on HTML tags that define separate sections (e.g.,
<div>,<p>,<table>). - Preprocessing: Contents of
<header>,<style>, and<script>tags are removed. Certain elements (e.g.,<select>, copyright symbols) are immediately classified as boilerplate. - Context-Free Classification: Each block is classified as:
- Bad (boilerplate) if it has high link density
- Short if it is too small to classify reliably
- Near-Good if it has moderate stopword density
- Good (main content) if it is long and contains many stopwords
- Context-Sensitive Classification: Short and near-good blocks are reclassified based on surrounding blocks, under the assumption that content and boilerplate tend to cluster together.
- Headings Processing: Header elements (e.g.,
<h1>,<h2>) are treated separately to preserve useful headings near content.
For non-spaced languages (Thai, Chinese, Japanese, Korean), the boilerplate check is automatically disabled since stopword density metrics are unreliable for these languages. A class-level set tracks which languages have already been warned about to avoid log spam.
Usage
Use this class when you need high-quality boilerplate removal from HTML pages, especially for building linguistic resources such as web corpora. It is particularly well-suited for preserving text containing full sentences. This is the default algorithm used by CommonCrawlHTMLExtractor.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/html_extractors/justext.py - Lines: 1-141
Signature
class JusTextExtractor(HTMLExtractorAlgorithm):
_logged_languages: ClassVar[set[str]] = set()
def __init__(
self,
length_low: int = 70,
length_high: int = 200,
stopwords_low: float = 0.30,
stopwords_high: float = 0.32,
max_link_density: float = 0.2,
max_heading_distance: int = 200,
no_headings: bool = False,
is_boilerplate: bool | None = None,
): ...
def extract_text(
self,
html: str,
stop_words: frozenset[str],
language: str,
) -> list[str] | None: ...
Import
from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| length_low | int | No | Minimum text length threshold for context-free classification. Defaults to 70 |
| length_high | int | No | Maximum text length threshold for context-free classification. Defaults to 200 |
| stopwords_low | float | No | Lower stopword density threshold. Defaults to 0.30 |
| stopwords_high | float | No | Upper stopword density threshold. Defaults to 0.32 |
| max_link_density | float | No | Maximum allowed link density in a text block. Defaults to 0.2 |
| max_heading_distance | int | No | Maximum distance from a heading to consider for context-sensitive classification. Defaults to 200 |
| no_headings | bool | No | If True, ignores headings during extraction. Defaults to False |
| is_boilerplate | bool or None | No | Controls boilerplate filtering. True filters boilerplate, False keeps all paragraphs. None (default) auto-selects based on language |
The extract_text method accepts:
| Name | Type | Required | Description |
|---|---|---|---|
| html | str | Yes | Decoded HTML content string |
| stop_words | frozenset[str] | Yes | Language-specific stop word set |
| language | str | Yes | Detected language name (uppercase, e.g., "ENGLISH") |
Outputs
| Name | Type | Description |
|---|---|---|
| return value | list[str] or None | List of extracted text paragraphs (non-boilerplate), or None if the HTML cannot be parsed |
Usage Examples
Basic Usage
from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor
extractor = JusTextExtractor()
html = "<html><body><p>This is the main content of the page.</p></body></html>"
stop_words = frozenset(["the", "is", "of"])
paragraphs = extractor.extract_text(html, stop_words, "ENGLISH")
Custom Parameters
from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor
# More aggressive boilerplate removal
extractor = JusTextExtractor(
length_low=100,
length_high=300,
stopwords_low=0.25,
stopwords_high=0.40,
max_link_density=0.1,
)
Disabling Boilerplate Check
from nemo_curator.stages.text.download.html_extractors.justext import JusTextExtractor
# Keep all paragraphs regardless of boilerplate classification
extractor = JusTextExtractor(is_boilerplate=False)
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_CommonCrawl_Extractor - Uses this as the default HTML extraction algorithm
- NVIDIA_NeMo_Curator_Resiliparse_Extractor - Alternative fast extraction algorithm
- NVIDIA_NeMo_Curator_Trafilatura_Extractor - Alternative high-quality extraction algorithm