Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove Trafilatura

From Leeroopedia
Knowledge Sources
Domains Text_Extraction, NLP
Type Wrapper Doc (wraps external trafilatura library)
Last Updated 2026-02-14 00:00 GMT

Overview

Wrapper around the Trafilatura library for HTML text extraction in the datatrove pipeline. This implementation provides a thin integration layer that passes HTML content to the trafilatura.extract() function and returns the resulting plain text, while the parent class BaseExtractor handles sandboxed process isolation, timeout management, and pipeline iteration.

Description

The Trafilatura class extends BaseExtractor and delegates the actual HTML-to-text conversion to the external trafilatura Python package. The class configures Trafilatura with sensible defaults for web-scale data processing:

  • favour_precision (default: True) -- Prefers less text but more accurate extraction, reducing noise in the output corpus.
  • deduplicate (default: True) -- Uses Trafilatura's built-in deduplication to remove repeated text blocks within a single document.
  • include_comments -- Always set to False, stripping user-generated comment sections.
  • timeout (default: 1 second) -- Per-document extraction time limit enforced by the parent class's ExtractorSandbox.

The trafilatura dependency is lazily imported inside the extract() method, so it is only required at runtime, not at import time.

Usage

Use Trafilatura as the extraction step in a datatrove pipeline, typically immediately after reading raw HTML from WARC archives and before applying content or language filters.

Code Reference

Source Location

Signature

class Trafilatura(BaseExtractor):
    name = "Trafilatura"
    _requires_dependencies = ["trafilatura"]

    def __init__(
        self,
        favour_precision: bool = True,
        include_images: bool = False,
        timeout: float = 1,
        deduplicate: bool = True,
        **kwargs,
    ):
        ...

    def extract(self, text: str) -> str:
        ...

Import

from datatrove.pipeline.extractors import Trafilatura

I/O Contract

Inputs

Name Type Required Description
favour_precision bool No (default: True) Prefer less text but correct extraction
include_images bool No (default: False) Include image references (not currently implemented; raises NotImplementedError)
timeout float No (default: 1) Per-document extraction timeout in seconds, enforced by the ExtractorSandbox
deduplicate bool No (default: True) Enable Trafilatura's internal text deduplication
**kwargs dict No Additional keyword arguments passed directly to trafilatura.extract()

Pipeline Input: A Document object with raw HTML content in its .text field.

Outputs

Name Type Description
Document.text str Plain text extracted from the HTML, with boilerplate and markup removed

Documents that produce empty text after extraction or that time out are dropped from the pipeline (not yielded to downstream steps). Statistics are tracked for: extracted, timeout, broken_process, clean_error.

Usage Examples

Basic Pipeline Usage

from datatrove.pipeline.extractors import Trafilatura

# Default settings: favour_precision=True, timeout=1s, deduplicate=True
extractor = Trafilatura()

Custom Configuration

from datatrove.pipeline.extractors import Trafilatura

# Longer timeout for complex pages, with recall-favoring extraction
extractor = Trafilatura(
    favour_precision=False,
    timeout=5.0,
    deduplicate=True,
)

Full Pipeline Example

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import LanguageFilter, URLFilter
from datatrove.pipeline.readers import WarcReader

pipeline = LocalPipelineExecutor(
    pipeline=[
        WarcReader("s3://commoncrawl/crawl-data/CC-MAIN-2024-10/"),
        Trafilatura(favour_precision=True, timeout=1),
        URLFilter(),
        LanguageFilter(languages=["en"], language_threshold=0.65),
    ],
    tasks=100,
)
pipeline.run()

Related Pages

Principle:Huggingface_Datatrove_HTML_Text_Extraction

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment