Implementation:Huggingface Datatrove ReadabilityInscriptis
| Knowledge Sources | |
|---|---|
| Domains | Text Extraction, Data Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Extracts clean text from HTML documents using a two-stage pipeline that combines the readability library for content identification with inscriptis for HTML-to-text conversion.
Description
ReadabilityInscriptis is a concrete text extractor that extends BaseExtractor to provide HTML-to-text conversion using two complementary libraries. The first stage uses readability-lxml (a fork maintained by Hugging Face with performance optimizations) to identify the main content of an HTML page, stripping navigation bars, sidebars, advertisements, and other boilerplate elements. The second stage uses inscriptis with a strict CSS profile to render the cleaned HTML as plain text, preserving meaningful formatting like paragraph breaks while removing visual-only markup.
The extractor applies a final cleanup step that collapses excessive consecutive newlines down to a configurable maximum (default: 2). This handles cases where the HTML-to-text conversion produces overly sparse output due to empty block elements.
The constructor configures readability's min_text_length (minimum character count for a text block to be considered content, default: 25) and min_text_score (threshold for the sum of sqrt(block_length - min_text_length) across all text blocks, default: 20). These parameters control how aggressively readability filters out short or sparse content. The default extraction timeout is 0.1 seconds, reflecting the typically fast execution of these lightweight libraries.
The class declares its dependencies via _requires_dependencies: the inscriptis package and a Hugging Face fork of readability-lxml with performance improvements.
Usage
Use ReadabilityInscriptis as an alternative to Trafilatura for HTML-to-text extraction. It is well-suited for article-style web pages where the main content is clearly distinguishable from boilerplate. It tends to be faster than Trafilatura but may produce less refined output for complex page layouts.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/extractors/modular.py
- Lines: 1-51
Signature
class ReadabilityInscriptis(BaseExtractor):
_requires_dependencies = [
"inscriptis",
("readability", "readability-lxml @ git+https://github.com/huggingface/python-readability.git@speedup"),
]
def __init__(
self,
max_new_lines: int = 2,
min_text_length: int = 25,
min_text_score: int = 20,
timeout: float = 0.1,
): ...
def extract(self, text: str) -> str: ...
Import
from datatrove.pipeline.extractors.modular import ReadabilityInscriptis
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| max_new_lines | int | No | Maximum consecutive newlines to keep in output (default: 2) |
| min_text_length | int | No | Minimum character length for a text block to be considered content (default: 25) |
| min_text_score | int | No | Minimum aggregate score for text blocks; documents below this are considered empty (default: 20) |
| timeout | float | No | Extraction timeout per document in seconds (default: 0.1) |
| text | str | Yes (extract) | Raw HTML string to extract text from |
Outputs
| Name | Type | Description |
|---|---|---|
| extracted text | str | Clean plain text extracted from the HTML, with excessive newlines collapsed |
Usage Examples
Basic Usage
from datatrove.pipeline.extractors.modular import ReadabilityInscriptis
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.readers.warc import WarcReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
# Basic pipeline: read WARC files, extract text, write JSONL
executor = LocalPipelineExecutor(
pipeline=[
WarcReader("input/warc_files/"),
ReadabilityInscriptis(
max_new_lines=2,
min_text_length=25,
min_text_score=20,
timeout=0.2,
),
JsonlWriter("output/extracted/"),
],
tasks=8,
workers=4,
logging_dir="logs/extraction",
)
stats = executor.run()