Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove ReadabilityInscriptis

From Leeroopedia
Knowledge Sources
Domains Text Extraction, Data Processing
Last Updated 2026-02-14 17:00 GMT

Overview

Extracts clean text from HTML documents using a two-stage pipeline that combines the readability library for content identification with inscriptis for HTML-to-text conversion.

Description

ReadabilityInscriptis is a concrete text extractor that extends BaseExtractor to provide HTML-to-text conversion using two complementary libraries. The first stage uses readability-lxml (a fork maintained by Hugging Face with performance optimizations) to identify the main content of an HTML page, stripping navigation bars, sidebars, advertisements, and other boilerplate elements. The second stage uses inscriptis with a strict CSS profile to render the cleaned HTML as plain text, preserving meaningful formatting like paragraph breaks while removing visual-only markup.

The extractor applies a final cleanup step that collapses excessive consecutive newlines down to a configurable maximum (default: 2). This handles cases where the HTML-to-text conversion produces overly sparse output due to empty block elements.

The constructor configures readability's min_text_length (minimum character count for a text block to be considered content, default: 25) and min_text_score (threshold for the sum of sqrt(block_length - min_text_length) across all text blocks, default: 20). These parameters control how aggressively readability filters out short or sparse content. The default extraction timeout is 0.1 seconds, reflecting the typically fast execution of these lightweight libraries.

The class declares its dependencies via _requires_dependencies: the inscriptis package and a Hugging Face fork of readability-lxml with performance improvements.

Usage

Use ReadabilityInscriptis as an alternative to Trafilatura for HTML-to-text extraction. It is well-suited for article-style web pages where the main content is clearly distinguishable from boilerplate. It tends to be faster than Trafilatura but may produce less refined output for complex page layouts.

Code Reference

Source Location

Signature

class ReadabilityInscriptis(BaseExtractor):
    _requires_dependencies = [
        "inscriptis",
        ("readability", "readability-lxml @ git+https://github.com/huggingface/python-readability.git@speedup"),
    ]

    def __init__(
        self,
        max_new_lines: int = 2,
        min_text_length: int = 25,
        min_text_score: int = 20,
        timeout: float = 0.1,
    ): ...

    def extract(self, text: str) -> str: ...

Import

from datatrove.pipeline.extractors.modular import ReadabilityInscriptis

I/O Contract

Inputs

Name Type Required Description
max_new_lines int No Maximum consecutive newlines to keep in output (default: 2)
min_text_length int No Minimum character length for a text block to be considered content (default: 25)
min_text_score int No Minimum aggregate score for text blocks; documents below this are considered empty (default: 20)
timeout float No Extraction timeout per document in seconds (default: 0.1)
text str Yes (extract) Raw HTML string to extract text from

Outputs

Name Type Description
extracted text str Clean plain text extracted from the HTML, with excessive newlines collapsed

Usage Examples

Basic Usage

from datatrove.pipeline.extractors.modular import ReadabilityInscriptis
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.readers.warc import WarcReader
from datatrove.pipeline.writers.jsonl import JsonlWriter

# Basic pipeline: read WARC files, extract text, write JSONL
executor = LocalPipelineExecutor(
    pipeline=[
        WarcReader("input/warc_files/"),
        ReadabilityInscriptis(
            max_new_lines=2,
            min_text_length=25,
            min_text_score=20,
            timeout=0.2,
        ),
        JsonlWriter("output/extracted/"),
    ],
    tasks=8,
    workers=4,
    logging_dir="logs/extraction",
)
stats = executor.run()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment