Environment:Datajuicer Data juicer Python Runtime Environment

Knowledge Sources	Data-Juicer pyproject.toml
Domains	Infrastructure, Data_Processing
Last Updated	2026-02-14 17:00 GMT

Overview

Python 3.10+ environment with core data processing dependencies including datasets, numpy, pandas, spacy, and multimedia libraries for text, image, audio, and video processing.

Description

This environment provides the base runtime context for all Data-Juicer operations. It is built on Python 3.10 or higher and includes a comprehensive set of core dependencies for data loading (HuggingFace datasets >= 2.19.0), numerical computation (numpy >= 1.26.4, < 2.0.0), text processing (spacy == 3.8.7), audio handling (librosa >= 0.10, av == 13.1.0), and configuration management (jsonargparse, pydantic >= 2.0). The project uses hatchling as its build backend with uv as the default package manager for operator-level dependency isolation.

Usage

Use this environment for any Data-Juicer workflow. It is the mandatory base prerequisite for all pipelines including text data processing, dataset quality analysis, custom operator development, and LLM-powered data generation. All other environments (Ray, GPU, API credentials) build on top of this base.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	macOS supported with caveats (Python 3.8 thread issue)
Python	>= 3.10	Hard requirement in pyproject.toml
Disk	5GB+ free	For package cache, model downloads, and dataset processing
RAM	4GB minimum	16GB+ recommended for large datasets

Dependencies

Core Packages

`datasets` >= 2.19.0
`numpy` >= 1.26.4, < 2.0.0
`pandas`
`pydantic` >= 2.0
`jsonargparse[signatures]`
`spacy` == 3.8.7
`loguru`
`tqdm`
`psutil`
`multiprocess` == 0.70.16
`dill` == 0.3.8
`uv`

Multimedia Packages

`av` == 13.1.0 (video/audio container handling)
`librosa` >= 0.10 (audio analysis)
`Pillow` (image processing)
`matplotlib`, `plotly`, `seaborn` (visualization)

Build System

`hatchling`
`uv` >= 0.1.0
`Cython` >= 0.29
`pybind11` >= 2.6
`setuptools` >= 64

Optional Dependency Groups

generic (ML/DL): torch == 2.8.0, transformers == 4.57.1, vllm == 0.11.0
vision: opencv-python, diffusers >= 0.33.0, ultralytics, decord
nlp: nltk == 3.9.1, easyocr == 1.7.1, fasttext-wheel, kenlm, sentencepiece, tiktoken
audio: torchaudio, soundfile, ffmpeg-python, audiomentations
distributed: ray[default] >= 2.51.0, pyspark == 3.5.5, s3fs, boto3
ai_services: dashscope, openai, label-studio == 1.17.0
dev: pytest, coverage, black >= 25.1.0, wandb <= 0.19.0

Credentials

The following environment variables are used for cache and storage configuration:

`CACHE_HOME`: Override default cache directory (default: `~/.cache`)
`DATA_JUICER_CACHE_HOME`: Data-Juicer specific cache (default: `~/.cache/data_juicer`)
`DATA_JUICER_MODELS_CACHE`: Model storage directory
`DATA_JUICER_ASSETS_CACHE`: Assets storage directory
`DJ_PRODUCED_DATA_DIR`: Output directory for processed data
`MP_START_METHOD`: Multiprocessing start method override (fork/forkserver/spawn)
`OMP_NUM_THREADS`: OpenMP thread count (auto-set to 1 on macOS Python 3.8)

Quick Install

# Install core package
pip install py-data-juicer

# Install with all optional dependencies
pip install "py-data-juicer[all]"

# Install specific extras
pip install "py-data-juicer[generic,vision,nlp]"

# Install for distributed processing
pip install "py-data-juicer[distributed]"

Code Evidence

Python version requirement from `pyproject.toml`:

requires-python = ">=3.10"

Package availability checking from `availability_utils.py:13-28`:

def _is_package_available(pkg_name: str, return_version: bool = False):
    package_exists = importlib.util.find_spec(pkg_name) is not None
    package_version = "N/A"
    if package_exists:
        try:
            package_version = importlib.metadata.version(pkg_name)
        except importlib.metadata.PackageNotFoundError:
            package_exists = False

macOS Python 3.8 thread workaround from `availability_utils.py:34-49`:

major, minor = sys.version_info[:2]
system = platform.system()
if major == 3 and minor == 8 and system == "Darwin":
    logger.warning(
        "The torch.set_num_threads function does not "
        "work in python3.8 version on Mac systems. We will set "
        "OMP_NUM_THREADS to 1 manually before importing torch"
    )
    os.environ["OMP_NUM_THREADS"] = str(1)
import torch
torch.set_num_threads(1)  # avoid hanging when calling clip in multiprocessing

Lazy loading pattern from `model_utils.py:31-48`:

torch = LazyLoader("torch")
transformers = LazyLoader("transformers")
fasttext = LazyLoader("fasttext-wheel", "fasttext-wheel")
kenlm = LazyLoader("kenlm")
vllm = LazyLoader("vllm")
cv2 = LazyLoader("cv2", "opencv-python")
openai = LazyLoader("openai")

Common Errors

Error Message	Cause	Solution
`Package [X] not found, installing...`	Optional dependency not installed	Run `pip install py-data-juicer[relevant_extra]` or let LazyLoader auto-install
`torch.set_num_threads does not work in python3.8 on Mac`	Known Python 3.8 + macOS threading bug	Upgrade to Python 3.10+ or set `OMP_NUM_THREADS=1`
`Failed to parse requirement from the requirement string`	Malformed requirement in operator env spec	Check requirement format matches PEP 508 syntax
`Backend should be one of ['pip', 'uv']`	Invalid package manager backend specified	Use either `pip` or `uv` (default is `uv`)
`uv not found or failed, falling back to pip`	uv package manager not installed	Install via `pip install uv` or use pip backend

Compatibility Notes

macOS: Python 3.8 has a known threading issue with PyTorch; OMP_NUM_THREADS is auto-set to 1. Upgrade to Python 3.10+ recommended.
Multiprocessing: CUDA operators and unforkable operators automatically use `forkserver` or `spawn` start method instead of `fork`.
numpy: Pinned to < 2.0.0 for compatibility with the HuggingFace datasets library.
fsspec: Pinned to == 2023.5.0 for stability with HuggingFace datasets.
uv Package Manager: Used as default backend for operator-level isolated environments in Ray mode. Falls back to pip if uv is unavailable.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment