Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Huggingface Datatrove Python Runtime

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Data_Processing
Last Updated 2026-02-14 17:00 GMT

Overview

Python 3.10+ runtime environment with core dependencies for running the Datatrove data processing pipeline.

Description

This environment provides the base Python runtime and core dependencies required by all Datatrove operations. It includes serialization (dill), filesystem abstraction (fsspec), HuggingFace Hub integration, numerical computing (numpy 2.0+), logging (loguru), and multiprocessing support. This is the minimum environment needed to define and execute any pipeline.

Usage

Use this environment for any Datatrove operation. All pipeline steps, executors, and utilities depend on this base runtime. It is the prerequisite for every workflow including Common Crawl Processing, MinHash Deduplication, FineWeb Dataset Creation, Dataset Tokenization, Synthetic Data Generation, and Summary Statistics.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) Tested on Ubuntu; macOS and Windows may work but are not CI-tested
Python >= 3.10.0 Tested with 3.10, 3.11, 3.12
Disk Varies by dataset Large-scale web data processing requires significant storage (TB-scale for Common Crawl)

Dependencies

Core Python Packages

  • `dill` >= 0.3.0
  • `fsspec` >= 2023.12.2
  • `huggingface-hub` >= 0.32.0
  • `humanize`
  • `loguru` >= 0.7.0
  • `multiprocess`
  • `numpy` >= 2.0.0
  • `tqdm`

Credentials

No credentials are required for the base runtime. However, individual pipeline steps may require:

  • `HF_TOKEN`: HuggingFace API token (for accessing gated datasets or models via `huggingface-hub`)

Quick Install

# Install base datatrove package
pip install datatrove

# Or install from source
pip install dill>=0.3.0 fsspec>=2023.12.2 "huggingface-hub>=0.32.0" humanize "loguru>=0.7.0" multiprocess "numpy>=2.0.0" tqdm

Code Evidence

Python version requirement from `pyproject.toml:23`:

requires-python = ">=3.10.0"

Core dependencies from `pyproject.toml:24-33`:

dependencies = [
  "dill>=0.3.0",
  "fsspec>=2023.12.2",
  "huggingface-hub>=0.32.0", # this version switches from hf-transfer to hf-xet
  "humanize",
  "loguru>=0.7.0",
  "multiprocess",
  "numpy>=2.0.0",
  "tqdm",
]

Dependency checking mechanism from `src/datatrove/utils/_import_utils.py:10-33`:

def check_required_dependencies(step_name: str, required_dependencies: list[str] | list[tuple[str, str]]):
    missing_dependencies: dict[str, str] = {}
    for dependency in required_dependencies:
        dependency = dependency if isinstance(dependency, tuple) else (dependency, dependency)
        package_name, pip_name = dependency
        if not _is_package_available(package_name):
            missing_dependencies[package_name] = pip_name
        if not _is_distribution_available(pip_name):
            missing_dependencies[package_name] = pip_name
    if missing_dependencies:
        _raise_error_for_missing_dependencies(step_name, missing_dependencies)

Common Errors

Error Message Cause Solution
`requires-python = ">=3.10.0"` build failure Python version below 3.10 Upgrade to Python 3.10, 3.11, or 3.12
`ImportError: Please install X to use Y` Missing optional dependency Run `pip install datatrove[group]` where group matches the needed feature (io, processing, inference, etc.)
`huggingface-hub` version error Old huggingface-hub version Upgrade with `pip install "huggingface-hub>=0.32.0"` (v0.32.0 switched from hf-transfer to hf-xet)

Compatibility Notes

  • Python 3.10+: Required due to use of `X | Y` type union syntax and other 3.10 features.
  • numpy 2.0+: Required. The ruff linting config enforces `NPY201` (numpy 2.0 compatibility).
  • CI Tested: Python 3.10, 3.11, 3.12 are tested in GitHub Actions CI.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment