Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Datajuicer Data juicer Python Runtime Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Data_Processing
Last Updated 2026-02-14 17:00 GMT

Overview

Python 3.10+ environment with core data processing dependencies including datasets, numpy, pandas, spacy, and multimedia libraries for text, image, audio, and video processing.

Description

This environment provides the base runtime context for all Data-Juicer operations. It is built on Python 3.10 or higher and includes a comprehensive set of core dependencies for data loading (HuggingFace datasets >= 2.19.0), numerical computation (numpy >= 1.26.4, < 2.0.0), text processing (spacy == 3.8.7), audio handling (librosa >= 0.10, av == 13.1.0), and configuration management (jsonargparse, pydantic >= 2.0). The project uses hatchling as its build backend with uv as the default package manager for operator-level dependency isolation.

Usage

Use this environment for any Data-Juicer workflow. It is the mandatory base prerequisite for all pipelines including text data processing, dataset quality analysis, custom operator development, and LLM-powered data generation. All other environments (Ray, GPU, API credentials) build on top of this base.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) macOS supported with caveats (Python 3.8 thread issue)
Python >= 3.10 Hard requirement in pyproject.toml
Disk 5GB+ free For package cache, model downloads, and dataset processing
RAM 4GB minimum 16GB+ recommended for large datasets

Dependencies

Core Packages

  • `datasets` >= 2.19.0
  • `numpy` >= 1.26.4, < 2.0.0
  • `pandas`
  • `pydantic` >= 2.0
  • `jsonargparse[signatures]`
  • `spacy` == 3.8.7
  • `loguru`
  • `tqdm`
  • `psutil`
  • `multiprocess` == 0.70.16
  • `dill` == 0.3.8
  • `uv`

Multimedia Packages

  • `av` == 13.1.0 (video/audio container handling)
  • `librosa` >= 0.10 (audio analysis)
  • `Pillow` (image processing)
  • `matplotlib`, `plotly`, `seaborn` (visualization)

Build System

  • `hatchling`
  • `uv` >= 0.1.0
  • `Cython` >= 0.29
  • `pybind11` >= 2.6
  • `setuptools` >= 64

Optional Dependency Groups

  • generic (ML/DL): torch == 2.8.0, transformers == 4.57.1, vllm == 0.11.0
  • vision: opencv-python, diffusers >= 0.33.0, ultralytics, decord
  • nlp: nltk == 3.9.1, easyocr == 1.7.1, fasttext-wheel, kenlm, sentencepiece, tiktoken
  • audio: torchaudio, soundfile, ffmpeg-python, audiomentations
  • distributed: ray[default] >= 2.51.0, pyspark == 3.5.5, s3fs, boto3
  • ai_services: dashscope, openai, label-studio == 1.17.0
  • dev: pytest, coverage, black >= 25.1.0, wandb <= 0.19.0

Credentials

The following environment variables are used for cache and storage configuration:

  • `CACHE_HOME`: Override default cache directory (default: `~/.cache`)
  • `DATA_JUICER_CACHE_HOME`: Data-Juicer specific cache (default: `~/.cache/data_juicer`)
  • `DATA_JUICER_MODELS_CACHE`: Model storage directory
  • `DATA_JUICER_ASSETS_CACHE`: Assets storage directory
  • `DJ_PRODUCED_DATA_DIR`: Output directory for processed data
  • `MP_START_METHOD`: Multiprocessing start method override (fork/forkserver/spawn)
  • `OMP_NUM_THREADS`: OpenMP thread count (auto-set to 1 on macOS Python 3.8)

Quick Install

# Install core package
pip install py-data-juicer

# Install with all optional dependencies
pip install "py-data-juicer[all]"

# Install specific extras
pip install "py-data-juicer[generic,vision,nlp]"

# Install for distributed processing
pip install "py-data-juicer[distributed]"

Code Evidence

Python version requirement from `pyproject.toml`:

requires-python = ">=3.10"

Package availability checking from `availability_utils.py:13-28`:

def _is_package_available(pkg_name: str, return_version: bool = False):
    package_exists = importlib.util.find_spec(pkg_name) is not None
    package_version = "N/A"
    if package_exists:
        try:
            package_version = importlib.metadata.version(pkg_name)
        except importlib.metadata.PackageNotFoundError:
            package_exists = False

macOS Python 3.8 thread workaround from `availability_utils.py:34-49`:

major, minor = sys.version_info[:2]
system = platform.system()
if major == 3 and minor == 8 and system == "Darwin":
    logger.warning(
        "The torch.set_num_threads function does not "
        "work in python3.8 version on Mac systems. We will set "
        "OMP_NUM_THREADS to 1 manually before importing torch"
    )
    os.environ["OMP_NUM_THREADS"] = str(1)
import torch
torch.set_num_threads(1)  # avoid hanging when calling clip in multiprocessing

Lazy loading pattern from `model_utils.py:31-48`:

torch = LazyLoader("torch")
transformers = LazyLoader("transformers")
fasttext = LazyLoader("fasttext-wheel", "fasttext-wheel")
kenlm = LazyLoader("kenlm")
vllm = LazyLoader("vllm")
cv2 = LazyLoader("cv2", "opencv-python")
openai = LazyLoader("openai")

Common Errors

Error Message Cause Solution
`Package [X] not found, installing...` Optional dependency not installed Run `pip install py-data-juicer[relevant_extra]` or let LazyLoader auto-install
`torch.set_num_threads does not work in python3.8 on Mac` Known Python 3.8 + macOS threading bug Upgrade to Python 3.10+ or set `OMP_NUM_THREADS=1`
`Failed to parse requirement from the requirement string` Malformed requirement in operator env spec Check requirement format matches PEP 508 syntax
`Backend should be one of ['pip', 'uv']` Invalid package manager backend specified Use either `pip` or `uv` (default is `uv`)
`uv not found or failed, falling back to pip` uv package manager not installed Install via `pip install uv` or use pip backend

Compatibility Notes

  • macOS: Python 3.8 has a known threading issue with PyTorch; OMP_NUM_THREADS is auto-set to 1. Upgrade to Python 3.10+ recommended.
  • Multiprocessing: CUDA operators and unforkable operators automatically use `forkserver` or `spawn` start method instead of `fork`.
  • numpy: Pinned to < 2.0.0 for compatibility with the HuggingFace datasets library.
  • fsspec: Pinned to == 2023.5.0 for stability with HuggingFace datasets.
  • uv Package Manager: Used as default backend for operator-level isolated environments in Ray mode. Falls back to pip if uv is unavailable.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment