Implementation:Huggingface Datatrove Trafilatura
| Knowledge Sources | |
|---|---|
| Domains | Text_Extraction, NLP |
| Type | Wrapper Doc (wraps external trafilatura library) |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Wrapper around the Trafilatura library for HTML text extraction in the datatrove pipeline. This implementation provides a thin integration layer that passes HTML content to the trafilatura.extract() function and returns the resulting plain text, while the parent class BaseExtractor handles sandboxed process isolation, timeout management, and pipeline iteration.
Description
The Trafilatura class extends BaseExtractor and delegates the actual HTML-to-text conversion to the external trafilatura Python package. The class configures Trafilatura with sensible defaults for web-scale data processing:
- favour_precision (default:
True) -- Prefers less text but more accurate extraction, reducing noise in the output corpus. - deduplicate (default:
True) -- Uses Trafilatura's built-in deduplication to remove repeated text blocks within a single document. - include_comments -- Always set to
False, stripping user-generated comment sections. - timeout (default:
1second) -- Per-document extraction time limit enforced by the parent class's ExtractorSandbox.
The trafilatura dependency is lazily imported inside the extract() method, so it is only required at runtime, not at import time.
Usage
Use Trafilatura as the extraction step in a datatrove pipeline, typically immediately after reading raw HTML from WARC archives and before applying content or language filters.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/extractors/trafilatura.py
- Lines: 4-55
Signature
class Trafilatura(BaseExtractor):
name = "Trafilatura"
_requires_dependencies = ["trafilatura"]
def __init__(
self,
favour_precision: bool = True,
include_images: bool = False,
timeout: float = 1,
deduplicate: bool = True,
**kwargs,
):
...
def extract(self, text: str) -> str:
...
Import
from datatrove.pipeline.extractors import Trafilatura
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| favour_precision | bool | No (default: True) | Prefer less text but correct extraction |
| include_images | bool | No (default: False) | Include image references (not currently implemented; raises NotImplementedError) |
| timeout | float | No (default: 1) | Per-document extraction timeout in seconds, enforced by the ExtractorSandbox |
| deduplicate | bool | No (default: True) | Enable Trafilatura's internal text deduplication |
| **kwargs | dict | No | Additional keyword arguments passed directly to trafilatura.extract()
|
Pipeline Input: A Document object with raw HTML content in its .text field.
Outputs
| Name | Type | Description |
|---|---|---|
| Document.text | str | Plain text extracted from the HTML, with boilerplate and markup removed |
Documents that produce empty text after extraction or that time out are dropped from the pipeline (not yielded to downstream steps). Statistics are tracked for: extracted, timeout, broken_process, clean_error.
Usage Examples
Basic Pipeline Usage
from datatrove.pipeline.extractors import Trafilatura
# Default settings: favour_precision=True, timeout=1s, deduplicate=True
extractor = Trafilatura()
Custom Configuration
from datatrove.pipeline.extractors import Trafilatura
# Longer timeout for complex pages, with recall-favoring extraction
extractor = Trafilatura(
favour_precision=False,
timeout=5.0,
deduplicate=True,
)
Full Pipeline Example
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import LanguageFilter, URLFilter
from datatrove.pipeline.readers import WarcReader
pipeline = LocalPipelineExecutor(
pipeline=[
WarcReader("s3://commoncrawl/crawl-data/CC-MAIN-2024-10/"),
Trafilatura(favour_precision=True, timeout=1),
URLFilter(),
LanguageFilter(languages=["en"], language_threshold=0.65),
],
tasks=100,
)
pipeline.run()
Related Pages
- Huggingface_Datatrove_HTML_Text_Extraction (principle) -- The principle this implementation realizes
- Huggingface_Datatrove_URLFilter (downstream filter) -- URL-based filtering applied after extraction
- Huggingface_Datatrove_LanguageFilter (downstream filter) -- Language identification applied to extracted text