Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Huggingface Datatrove Processing Dependencies

From Leeroopedia
Knowledge Sources
Domains NLP, Data_Processing
Last Updated 2026-02-14 17:00 GMT

Overview

NLP processing dependency group providing language identification, text extraction, hashing, tokenization, and pattern matching for quality filtering and deduplication pipelines.

Description

This environment provides the packages needed for core text processing operations: language identification (fastText), HTML text extraction (trafilatura, inscriptis), word tokenization (nltk, tokenizers), text encoding fixes (ftfy), domain extraction (tldextract), efficient hashing (xxhash), pattern matching (pyahocorasick), file locking (fasteners), and regex support. It is required for all quality filtering, deduplication, and text extraction pipeline steps.

Usage

Use this environment when running quality filtering (GopherQualityFilter, GopherRepetitionFilter, C4QualityFilter, FineWebQualityFilter), language filtering (LanguageFilter), text extraction (Trafilatura), URL filtering (URLFilter), deduplication (MinHash, Bloom filter, exact dedup, sentence dedup), tokenization (DocumentTokenizer), and PII removal (PIIFormatter) pipelines.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) Some NLP libraries may have platform-specific wheels
Disk ~500MB FastText language model downloads on first use

Dependencies

Python Packages

  • `fasttext-numpy2-wheel` — Language identification (numpy 2.0 compatible wheel)
  • `nltk` — Natural language toolkit (punkt tokenizer required)
  • `inscriptis` — HTML to text conversion
  • `tldextract` — Domain/TLD extraction from URLs
  • `trafilatura` >= 1.8.0, < 1.12.0 — Web content extraction (version range constrained)
  • `tokenizers` — HuggingFace tokenizers library
  • `ftfy` — Fix text encoding issues
  • `fasteners` — File-based locking for concurrent access
  • `regex` — Advanced regular expressions
  • `xxhash` — Fast non-cryptographic hashing
  • `pyahocorasick` — Aho-Corasick multi-pattern matching

Data Downloads

  • NLTK punkt tokenizer data must be downloaded: `python -m nltk.downloader punkt`
  • FastText language identification model is downloaded automatically on first use

Credentials

No specific credentials required for processing dependencies.

Quick Install

# Install datatrove with processing dependencies
pip install "datatrove[processing]"

# Download required NLTK data
python -m nltk.downloader punkt

# Or install packages individually
pip install fasttext-numpy2-wheel nltk inscriptis tldextract "trafilatura>=1.8.0,<1.12.0" tokenizers ftfy fasteners regex xxhash pyahocorasick

Code Evidence

Processing dependency group from `pyproject.toml:51-63`:

processing = [
  "fasttext-numpy2-wheel",
  "nltk",
  "inscriptis",
  "tldextract",
  "trafilatura>=1.8.0,<1.12.0",
  "tokenizers",
  "ftfy",
  "fasteners",
  "regex",
  "xxhash",
  "pyahocorasick",
]

FastText dependency check with pip name mapping from `src/datatrove/pipeline/filters/language_filter.py:11`:

_requires_dependencies = [("fasttext", "fasttext-numpy2-wheel"), "fasteners"]

Trafilatura version constraint rationale: the upper bound `<1.12.0` prevents breaking changes from newer trafilatura versions that may alter extraction behavior.

Kiwipiepy version pin comment from `pyproject.toml:74`:

"kiwipiepy<0.22.0", # korean - v0.22.0 introduced issues, pin to previous version

Common Errors

Error Message Cause Solution
`ImportError: Please install fasttext and fasteners to use Language ID` fasttext or fasteners not installed `pip install fasttext-numpy2-wheel fasteners`
`ImportError: Please install trafilatura to use Trafilatura` trafilatura not installed `pip install "trafilatura>=1.8.0,<1.12.0"`
`LookupError: Resource punkt not found` NLTK punkt data not downloaded `python -m nltk.downloader punkt`
`ImportError: Please install xxhash to use MinhashDedupSignature` xxhash not installed `pip install xxhash`

Compatibility Notes

  • trafilatura >= 1.8.0, < 1.12.0: Version range is constrained. Versions >= 1.12.0 may change extraction behavior.
  • fasttext-numpy2-wheel: This is a special numpy 2.0-compatible wheel of fastText. Using the standard `fasttext` package may cause issues with numpy >= 2.0.
  • kiwipiepy < 0.22.0: Korean tokenizer pinned below 0.22.0 due to regressions introduced in that version (multilingual group).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment