Environment:Huggingface Datatrove Processing Dependencies
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
NLP processing dependency group providing language identification, text extraction, hashing, tokenization, and pattern matching for quality filtering and deduplication pipelines.
Description
This environment provides the packages needed for core text processing operations: language identification (fastText), HTML text extraction (trafilatura, inscriptis), word tokenization (nltk, tokenizers), text encoding fixes (ftfy), domain extraction (tldextract), efficient hashing (xxhash), pattern matching (pyahocorasick), file locking (fasteners), and regex support. It is required for all quality filtering, deduplication, and text extraction pipeline steps.
Usage
Use this environment when running quality filtering (GopherQualityFilter, GopherRepetitionFilter, C4QualityFilter, FineWebQualityFilter), language filtering (LanguageFilter), text extraction (Trafilatura), URL filtering (URLFilter), deduplication (MinHash, Bloom filter, exact dedup, sentence dedup), tokenization (DocumentTokenizer), and PII removal (PIIFormatter) pipelines.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | Some NLP libraries may have platform-specific wheels |
| Disk | ~500MB | FastText language model downloads on first use |
Dependencies
Python Packages
- `fasttext-numpy2-wheel` — Language identification (numpy 2.0 compatible wheel)
- `nltk` — Natural language toolkit (punkt tokenizer required)
- `inscriptis` — HTML to text conversion
- `tldextract` — Domain/TLD extraction from URLs
- `trafilatura` >= 1.8.0, < 1.12.0 — Web content extraction (version range constrained)
- `tokenizers` — HuggingFace tokenizers library
- `ftfy` — Fix text encoding issues
- `fasteners` — File-based locking for concurrent access
- `regex` — Advanced regular expressions
- `xxhash` — Fast non-cryptographic hashing
- `pyahocorasick` — Aho-Corasick multi-pattern matching
Data Downloads
- NLTK punkt tokenizer data must be downloaded: `python -m nltk.downloader punkt`
- FastText language identification model is downloaded automatically on first use
Credentials
No specific credentials required for processing dependencies.
Quick Install
# Install datatrove with processing dependencies
pip install "datatrove[processing]"
# Download required NLTK data
python -m nltk.downloader punkt
# Or install packages individually
pip install fasttext-numpy2-wheel nltk inscriptis tldextract "trafilatura>=1.8.0,<1.12.0" tokenizers ftfy fasteners regex xxhash pyahocorasick
Code Evidence
Processing dependency group from `pyproject.toml:51-63`:
processing = [
"fasttext-numpy2-wheel",
"nltk",
"inscriptis",
"tldextract",
"trafilatura>=1.8.0,<1.12.0",
"tokenizers",
"ftfy",
"fasteners",
"regex",
"xxhash",
"pyahocorasick",
]
FastText dependency check with pip name mapping from `src/datatrove/pipeline/filters/language_filter.py:11`:
_requires_dependencies = [("fasttext", "fasttext-numpy2-wheel"), "fasteners"]
Trafilatura version constraint rationale: the upper bound `<1.12.0` prevents breaking changes from newer trafilatura versions that may alter extraction behavior.
Kiwipiepy version pin comment from `pyproject.toml:74`:
"kiwipiepy<0.22.0", # korean - v0.22.0 introduced issues, pin to previous version
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: Please install fasttext and fasteners to use Language ID` | fasttext or fasteners not installed | `pip install fasttext-numpy2-wheel fasteners` |
| `ImportError: Please install trafilatura to use Trafilatura` | trafilatura not installed | `pip install "trafilatura>=1.8.0,<1.12.0"` |
| `LookupError: Resource punkt not found` | NLTK punkt data not downloaded | `python -m nltk.downloader punkt` |
| `ImportError: Please install xxhash to use MinhashDedupSignature` | xxhash not installed | `pip install xxhash` |
Compatibility Notes
- trafilatura >= 1.8.0, < 1.12.0: Version range is constrained. Versions >= 1.12.0 may change extraction behavior.
- fasttext-numpy2-wheel: This is a special numpy 2.0-compatible wheel of fastText. Using the standard `fasttext` package may cause issues with numpy >= 2.0.
- kiwipiepy < 0.22.0: Korean tokenizer pinned below 0.22.0 due to regressions introduced in that version (multilingual group).
Related Pages
- Implementation:Huggingface_Datatrove_Trafilatura
- Implementation:Huggingface_Datatrove_URLFilter
- Implementation:Huggingface_Datatrove_LanguageFilter
- Implementation:Huggingface_Datatrove_GopherRepetitionFilter
- Implementation:Huggingface_Datatrove_GopherQualityFilter
- Implementation:Huggingface_Datatrove_C4QualityFilter
- Implementation:Huggingface_Datatrove_FineWebQualityFilter
- Implementation:Huggingface_Datatrove_MinhashDedupSignature
- Implementation:Huggingface_Datatrove_MinhashDedupBuckets
- Implementation:Huggingface_Datatrove_MinhashDedupCluster
- Implementation:Huggingface_Datatrove_MinhashDedupFilter
- Implementation:Huggingface_Datatrove_PIIFormatter
- Implementation:Huggingface_Datatrove_DocumentTokenizer