Environment:Huggingface Datatrove Python Runtime
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Data_Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Python 3.10+ runtime environment with core dependencies for running the Datatrove data processing pipeline.
Description
This environment provides the base Python runtime and core dependencies required by all Datatrove operations. It includes serialization (dill), filesystem abstraction (fsspec), HuggingFace Hub integration, numerical computing (numpy 2.0+), logging (loguru), and multiprocessing support. This is the minimum environment needed to define and execute any pipeline.
Usage
Use this environment for any Datatrove operation. All pipeline steps, executors, and utilities depend on this base runtime. It is the prerequisite for every workflow including Common Crawl Processing, MinHash Deduplication, FineWeb Dataset Creation, Dataset Tokenization, Synthetic Data Generation, and Summary Statistics.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | Tested on Ubuntu; macOS and Windows may work but are not CI-tested |
| Python | >= 3.10.0 | Tested with 3.10, 3.11, 3.12 |
| Disk | Varies by dataset | Large-scale web data processing requires significant storage (TB-scale for Common Crawl) |
Dependencies
Core Python Packages
- `dill` >= 0.3.0
- `fsspec` >= 2023.12.2
- `huggingface-hub` >= 0.32.0
- `humanize`
- `loguru` >= 0.7.0
- `multiprocess`
- `numpy` >= 2.0.0
- `tqdm`
Credentials
No credentials are required for the base runtime. However, individual pipeline steps may require:
- `HF_TOKEN`: HuggingFace API token (for accessing gated datasets or models via `huggingface-hub`)
Quick Install
# Install base datatrove package
pip install datatrove
# Or install from source
pip install dill>=0.3.0 fsspec>=2023.12.2 "huggingface-hub>=0.32.0" humanize "loguru>=0.7.0" multiprocess "numpy>=2.0.0" tqdm
Code Evidence
Python version requirement from `pyproject.toml:23`:
requires-python = ">=3.10.0"
Core dependencies from `pyproject.toml:24-33`:
dependencies = [
"dill>=0.3.0",
"fsspec>=2023.12.2",
"huggingface-hub>=0.32.0", # this version switches from hf-transfer to hf-xet
"humanize",
"loguru>=0.7.0",
"multiprocess",
"numpy>=2.0.0",
"tqdm",
]
Dependency checking mechanism from `src/datatrove/utils/_import_utils.py:10-33`:
def check_required_dependencies(step_name: str, required_dependencies: list[str] | list[tuple[str, str]]):
missing_dependencies: dict[str, str] = {}
for dependency in required_dependencies:
dependency = dependency if isinstance(dependency, tuple) else (dependency, dependency)
package_name, pip_name = dependency
if not _is_package_available(package_name):
missing_dependencies[package_name] = pip_name
if not _is_distribution_available(pip_name):
missing_dependencies[package_name] = pip_name
if missing_dependencies:
_raise_error_for_missing_dependencies(step_name, missing_dependencies)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `requires-python = ">=3.10.0"` build failure | Python version below 3.10 | Upgrade to Python 3.10, 3.11, or 3.12 |
| `ImportError: Please install X to use Y` | Missing optional dependency | Run `pip install datatrove[group]` where group matches the needed feature (io, processing, inference, etc.) |
| `huggingface-hub` version error | Old huggingface-hub version | Upgrade with `pip install "huggingface-hub>=0.32.0"` (v0.32.0 switched from hf-transfer to hf-xet) |
Compatibility Notes
- Python 3.10+: Required due to use of `X | Y` type union syntax and other 3.10 features.
- numpy 2.0+: Required. The ruff linting config enforces `NPY201` (numpy 2.0 compatibility).
- CI Tested: Python 3.10, 3.11, 3.12 are tested in GitHub Actions CI.
Related Pages
- Implementation:Huggingface_Datatrove_WarcReader
- Implementation:Huggingface_Datatrove_JsonlReader
- Implementation:Huggingface_Datatrove_HuggingFaceDatasetReader
- Implementation:Huggingface_Datatrove_JsonlWriter
- Implementation:Huggingface_Datatrove_ParquetWriter
- Implementation:Huggingface_Datatrove_DocumentTokenizer
- Implementation:Huggingface_Datatrove_DocumentTokenizerMerger
- Implementation:Huggingface_Datatrove_TokensCounter
- Implementation:Huggingface_Datatrove_StatsMerger
- Implementation:Huggingface_Datatrove_SamplerFilter
- Implementation:Huggingface_Datatrove_WordStats
- Implementation:Huggingface_Datatrove_LineStats
- Implementation:Huggingface_Datatrove_DocStats