Implementation:Huggingface Datatrove ReadabilityInscriptis

Knowledge Sources	Huggingface_Datatrove
Domains	Text Extraction, Data Processing
Last Updated	2026-02-14 17:00 GMT

Overview

Extracts clean text from HTML documents using a two-stage pipeline that combines the readability library for content identification with inscriptis for HTML-to-text conversion.

Description

ReadabilityInscriptis is a concrete text extractor that extends BaseExtractor to provide HTML-to-text conversion using two complementary libraries. The first stage uses readability-lxml (a fork maintained by Hugging Face with performance optimizations) to identify the main content of an HTML page, stripping navigation bars, sidebars, advertisements, and other boilerplate elements. The second stage uses inscriptis with a strict CSS profile to render the cleaned HTML as plain text, preserving meaningful formatting like paragraph breaks while removing visual-only markup.

The extractor applies a final cleanup step that collapses excessive consecutive newlines down to a configurable maximum (default: 2). This handles cases where the HTML-to-text conversion produces overly sparse output due to empty block elements.

The constructor configures readability's min_text_length (minimum character count for a text block to be considered content, default: 25) and min_text_score (threshold for the sum of sqrt(block_length - min_text_length) across all text blocks, default: 20). These parameters control how aggressively readability filters out short or sparse content. The default extraction timeout is 0.1 seconds, reflecting the typically fast execution of these lightweight libraries.

The class declares its dependencies via _requires_dependencies: the inscriptis package and a Hugging Face fork of readability-lxml with performance improvements.

Usage

Use ReadabilityInscriptis as an alternative to Trafilatura for HTML-to-text extraction. It is well-suited for article-style web pages where the main content is clearly distinguishable from boilerplate. It tends to be faster than Trafilatura but may produce less refined output for complex page layouts.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/extractors/modular.py
Lines: 1-51

Signature

class ReadabilityInscriptis(BaseExtractor):
    _requires_dependencies = [
        "inscriptis",
        ("readability", "readability-lxml @ git+https://github.com/huggingface/python-readability.git@speedup"),
    ]

    def __init__(
        self,
        max_new_lines: int = 2,
        min_text_length: int = 25,
        min_text_score: int = 20,
        timeout: float = 0.1,
    ): ...

    def extract(self, text: str) -> str: ...

Import

from datatrove.pipeline.extractors.modular import ReadabilityInscriptis

I/O Contract

Inputs

Name	Type	Required	Description
max_new_lines	int	No	Maximum consecutive newlines to keep in output (default: 2)
min_text_length	int	No	Minimum character length for a text block to be considered content (default: 25)
min_text_score	int	No	Minimum aggregate score for text blocks; documents below this are considered empty (default: 20)
timeout	float	No	Extraction timeout per document in seconds (default: 0.1)
text	str	Yes (extract)	Raw HTML string to extract text from

Outputs

Name	Type	Description
extracted text	str	Clean plain text extracted from the HTML, with excessive newlines collapsed

Usage Examples

Basic Usage

from datatrove.pipeline.extractors.modular import ReadabilityInscriptis
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.readers.warc import WarcReader
from datatrove.pipeline.writers.jsonl import JsonlWriter

# Basic pipeline: read WARC files, extract text, write JSONL
executor = LocalPipelineExecutor(
    pipeline=[
        WarcReader("input/warc_files/"),
        ReadabilityInscriptis(
            max_new_lines=2,
            min_text_length=25,
            min_text_score=20,
            timeout=0.2,
        ),
        JsonlWriter("output/extracted/"),
    ],
    tasks=8,
    workers=4,
    logging_dir="logs/extraction",
)
stats = executor.run()

Related Pages

Principle:Huggingface_Datatrove_HTML_Text_Extraction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment