Implementation:Huggingface Datatrove Trafilatura

Knowledge Sources	Huggingface_Datatrove Trafilatura
Domains	Text_Extraction, NLP
Type	Wrapper Doc (wraps external trafilatura library)
Last Updated	2026-02-14 00:00 GMT

Overview

Wrapper around the Trafilatura library for HTML text extraction in the datatrove pipeline. This implementation provides a thin integration layer that passes HTML content to the trafilatura.extract() function and returns the resulting plain text, while the parent class BaseExtractor handles sandboxed process isolation, timeout management, and pipeline iteration.

Description

The Trafilatura class extends BaseExtractor and delegates the actual HTML-to-text conversion to the external trafilatura Python package. The class configures Trafilatura with sensible defaults for web-scale data processing:

favour_precision (default: True) -- Prefers less text but more accurate extraction, reducing noise in the output corpus.
deduplicate (default: True) -- Uses Trafilatura's built-in deduplication to remove repeated text blocks within a single document.
include_comments -- Always set to False, stripping user-generated comment sections.
timeout (default: 1 second) -- Per-document extraction time limit enforced by the parent class's ExtractorSandbox.

The trafilatura dependency is lazily imported inside the extract() method, so it is only required at runtime, not at import time.

Usage

Use Trafilatura as the extraction step in a datatrove pipeline, typically immediately after reading raw HTML from WARC archives and before applying content or language filters.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/extractors/trafilatura.py
Lines: 4-55

Signature

class Trafilatura(BaseExtractor):
    name = "Trafilatura"
    _requires_dependencies = ["trafilatura"]

    def __init__(
        self,
        favour_precision: bool = True,
        include_images: bool = False,
        timeout: float = 1,
        deduplicate: bool = True,
        **kwargs,
    ):
        ...

    def extract(self, text: str) -> str:
        ...

Import

from datatrove.pipeline.extractors import Trafilatura

I/O Contract

Inputs

Name	Type	Required	Description
favour_precision	bool	No (default: True)	Prefer less text but correct extraction
include_images	bool	No (default: False)	Include image references (not currently implemented; raises NotImplementedError)
timeout	float	No (default: 1)	Per-document extraction timeout in seconds, enforced by the ExtractorSandbox
deduplicate	bool	No (default: True)	Enable Trafilatura's internal text deduplication
**kwargs	dict	No	Additional keyword arguments passed directly to `trafilatura.extract()`

Pipeline Input: A Document object with raw HTML content in its .text field.

Outputs

Name	Type	Description
Document.text	str	Plain text extracted from the HTML, with boilerplate and markup removed

Documents that produce empty text after extraction or that time out are dropped from the pipeline (not yielded to downstream steps). Statistics are tracked for: extracted, timeout, broken_process, clean_error.

Usage Examples

Basic Pipeline Usage

from datatrove.pipeline.extractors import Trafilatura

# Default settings: favour_precision=True, timeout=1s, deduplicate=True
extractor = Trafilatura()

Custom Configuration

from datatrove.pipeline.extractors import Trafilatura

# Longer timeout for complex pages, with recall-favoring extraction
extractor = Trafilatura(
    favour_precision=False,
    timeout=5.0,
    deduplicate=True,
)

Full Pipeline Example

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import LanguageFilter, URLFilter
from datatrove.pipeline.readers import WarcReader

pipeline = LocalPipelineExecutor(
    pipeline=[
        WarcReader("s3://commoncrawl/crawl-data/CC-MAIN-2024-10/"),
        Trafilatura(favour_precision=True, timeout=1),
        URLFilter(),
        LanguageFilter(languages=["en"], language_threshold=0.65),
    ],
    tasks=100,
)
pipeline.run()

Related Pages

Huggingface_Datatrove_HTML_Text_Extraction (principle) -- The principle this implementation realizes
Huggingface_Datatrove_URLFilter (downstream filter) -- URL-based filtering applied after extraction
Huggingface_Datatrove_LanguageFilter (downstream filter) -- Language identification applied to extracted text

Principle:Huggingface_Datatrove_HTML_Text_Extraction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment