Environment:Datajuicer Data juicer Python Runtime Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Data_Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Python 3.10+ environment with core data processing dependencies including datasets, numpy, pandas, spacy, and multimedia libraries for text, image, audio, and video processing.
Description
This environment provides the base runtime context for all Data-Juicer operations. It is built on Python 3.10 or higher and includes a comprehensive set of core dependencies for data loading (HuggingFace datasets >= 2.19.0), numerical computation (numpy >= 1.26.4, < 2.0.0), text processing (spacy == 3.8.7), audio handling (librosa >= 0.10, av == 13.1.0), and configuration management (jsonargparse, pydantic >= 2.0). The project uses hatchling as its build backend with uv as the default package manager for operator-level dependency isolation.
Usage
Use this environment for any Data-Juicer workflow. It is the mandatory base prerequisite for all pipelines including text data processing, dataset quality analysis, custom operator development, and LLM-powered data generation. All other environments (Ray, GPU, API credentials) build on top of this base.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | macOS supported with caveats (Python 3.8 thread issue) |
| Python | >= 3.10 | Hard requirement in pyproject.toml |
| Disk | 5GB+ free | For package cache, model downloads, and dataset processing |
| RAM | 4GB minimum | 16GB+ recommended for large datasets |
Dependencies
Core Packages
- `datasets` >= 2.19.0
- `numpy` >= 1.26.4, < 2.0.0
- `pandas`
- `pydantic` >= 2.0
- `jsonargparse[signatures]`
- `spacy` == 3.8.7
- `loguru`
- `tqdm`
- `psutil`
- `multiprocess` == 0.70.16
- `dill` == 0.3.8
- `uv`
Multimedia Packages
- `av` == 13.1.0 (video/audio container handling)
- `librosa` >= 0.10 (audio analysis)
- `Pillow` (image processing)
- `matplotlib`, `plotly`, `seaborn` (visualization)
Build System
- `hatchling`
- `uv` >= 0.1.0
- `Cython` >= 0.29
- `pybind11` >= 2.6
- `setuptools` >= 64
Optional Dependency Groups
- generic (ML/DL): torch == 2.8.0, transformers == 4.57.1, vllm == 0.11.0
- vision: opencv-python, diffusers >= 0.33.0, ultralytics, decord
- nlp: nltk == 3.9.1, easyocr == 1.7.1, fasttext-wheel, kenlm, sentencepiece, tiktoken
- audio: torchaudio, soundfile, ffmpeg-python, audiomentations
- distributed: ray[default] >= 2.51.0, pyspark == 3.5.5, s3fs, boto3
- ai_services: dashscope, openai, label-studio == 1.17.0
- dev: pytest, coverage, black >= 25.1.0, wandb <= 0.19.0
Credentials
The following environment variables are used for cache and storage configuration:
- `CACHE_HOME`: Override default cache directory (default: `~/.cache`)
- `DATA_JUICER_CACHE_HOME`: Data-Juicer specific cache (default: `~/.cache/data_juicer`)
- `DATA_JUICER_MODELS_CACHE`: Model storage directory
- `DATA_JUICER_ASSETS_CACHE`: Assets storage directory
- `DJ_PRODUCED_DATA_DIR`: Output directory for processed data
- `MP_START_METHOD`: Multiprocessing start method override (fork/forkserver/spawn)
- `OMP_NUM_THREADS`: OpenMP thread count (auto-set to 1 on macOS Python 3.8)
Quick Install
# Install core package
pip install py-data-juicer
# Install with all optional dependencies
pip install "py-data-juicer[all]"
# Install specific extras
pip install "py-data-juicer[generic,vision,nlp]"
# Install for distributed processing
pip install "py-data-juicer[distributed]"
Code Evidence
Python version requirement from `pyproject.toml`:
requires-python = ">=3.10"
Package availability checking from `availability_utils.py:13-28`:
def _is_package_available(pkg_name: str, return_version: bool = False):
package_exists = importlib.util.find_spec(pkg_name) is not None
package_version = "N/A"
if package_exists:
try:
package_version = importlib.metadata.version(pkg_name)
except importlib.metadata.PackageNotFoundError:
package_exists = False
macOS Python 3.8 thread workaround from `availability_utils.py:34-49`:
major, minor = sys.version_info[:2]
system = platform.system()
if major == 3 and minor == 8 and system == "Darwin":
logger.warning(
"The torch.set_num_threads function does not "
"work in python3.8 version on Mac systems. We will set "
"OMP_NUM_THREADS to 1 manually before importing torch"
)
os.environ["OMP_NUM_THREADS"] = str(1)
import torch
torch.set_num_threads(1) # avoid hanging when calling clip in multiprocessing
Lazy loading pattern from `model_utils.py:31-48`:
torch = LazyLoader("torch")
transformers = LazyLoader("transformers")
fasttext = LazyLoader("fasttext-wheel", "fasttext-wheel")
kenlm = LazyLoader("kenlm")
vllm = LazyLoader("vllm")
cv2 = LazyLoader("cv2", "opencv-python")
openai = LazyLoader("openai")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Package [X] not found, installing...` | Optional dependency not installed | Run `pip install py-data-juicer[relevant_extra]` or let LazyLoader auto-install |
| `torch.set_num_threads does not work in python3.8 on Mac` | Known Python 3.8 + macOS threading bug | Upgrade to Python 3.10+ or set `OMP_NUM_THREADS=1` |
| `Failed to parse requirement from the requirement string` | Malformed requirement in operator env spec | Check requirement format matches PEP 508 syntax |
| `Backend should be one of ['pip', 'uv']` | Invalid package manager backend specified | Use either `pip` or `uv` (default is `uv`) |
| `uv not found or failed, falling back to pip` | uv package manager not installed | Install via `pip install uv` or use pip backend |
Compatibility Notes
- macOS: Python 3.8 has a known threading issue with PyTorch; OMP_NUM_THREADS is auto-set to 1. Upgrade to Python 3.10+ recommended.
- Multiprocessing: CUDA operators and unforkable operators automatically use `forkserver` or `spawn` start method instead of `fork`.
- numpy: Pinned to < 2.0.0 for compatibility with the HuggingFace datasets library.
- fsspec: Pinned to == 2023.5.0 for stability with HuggingFace datasets.
- uv Package Manager: Used as default backend for operator-level isolated environments in Ray mode. Falls back to pip if uv is unavailable.
Related Pages
- Implementation:Datajuicer_Data_juicer_Init_Configs
- Implementation:Datajuicer_Data_juicer_DatasetBuilder_Load_Dataset
- Implementation:Datajuicer_Data_juicer_Load_Ops
- Implementation:Datajuicer_Data_juicer_NestedDataset_Process
- Implementation:Datajuicer_Data_juicer_Exporter_Export
- Implementation:Datajuicer_Data_juicer_Operator_Base_Classes
- Implementation:Datajuicer_Data_juicer_OPEnvSpec_And_LazyLoader