Environment:Datajuicer Data juicer GPU CUDA Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, Computer_Vision |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
NVIDIA GPU environment with CUDA support, PyTorch 2.8.0, and optional vLLM 0.11.0 for GPU-accelerated operators including ML model inference, image/video processing, and LLM serving.
Description
This environment provides GPU-accelerated execution for Data-Juicer operators that require CUDA hardware. It includes PyTorch 2.8.0 with CUDA support, transformers 4.57.1 for model loading, and optional vLLM 0.11.0 for high-throughput LLM inference. The system uses multi-level GPU detection (Ray cluster query, PyTorch CUDA API, nvidia-smi fallback) and automatic resource allocation that distributes operators across available GPUs based on memory requirements.
Usage
Use this environment when running GPU-accelerated operators such as image aesthetics filters, NSFW detectors, text embedding similarity filters, LLM-based mappers, video processing operators, and vLLM inference pipelines. Operators with `accelerator: cuda` in their configuration automatically trigger GPU execution. The default CUDA batch size is 10 (vs 1000 for CPU) to manage GPU memory constraints.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | Windows not officially supported for CUDA operations |
| Hardware | NVIDIA GPU | Minimum 4GB VRAM; 16GB+ recommended for LLM inference |
| Driver | NVIDIA Driver 525+ | Compatible with CUDA 12.x |
| CUDA | CUDA 12.x toolkit | Required for PyTorch 2.8.0 and vLLM |
| RAM | 16GB+ system RAM | GPU operators often need CPU staging memory |
Dependencies
System Packages
- NVIDIA CUDA Toolkit 12.x
- NVIDIA cuDNN
- `nvidia-smi` CLI tool (for resource detection)
Python Packages (generic extra)
- `torch` == 2.8.0
- `transformers` == 4.57.1
- `einops`
- `accelerate`
- `onnxruntime`
- `cudf-cu12` == 25.4.0
Python Packages (vLLM for LLM inference)
- `vllm` == 0.11.0
- `uvloop` == 0.21.0
Computer Vision Packages (vision extra)
- `opencv-python`, `opencv-contrib-python`
- `diffusers` >= 0.33.0
- `ultralytics`
- `rembg`
- `decord`
- `timm` == 1.0.22
Optional ML Frameworks
- `openmim` (auto-installed for mmpose operators)
- `mmcv` == 2.1.0 (auto-installed via mim)
- `mmdeploy` (auto-installed via mim)
Credentials
- `CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible to the process
- `VLLM_WORKER_MULTIPROC_METHOD`: Set to `spawn` automatically for vLLM workers
- `OMP_NUM_THREADS`: Thread count (auto-set to 1 to prevent multiprocessing hangs)
Quick Install
# Install with generic ML extras
pip install "py-data-juicer[generic]"
# Install with vision extras for image/video processing
pip install "py-data-juicer[generic,vision]"
# Install with all extras
pip install "py-data-juicer[all]"
# Verify CUDA availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}')"
Code Evidence
Multi-level CUDA detection from `resource_utils.py:66-88`:
def _cuda_device_count(cfg=None):
_torch_available = _is_package_available("torch")
if check_and_initialize_ray(cfg):
return int(ray_gpu_count())
if _torch_available:
return torch.cuda.device_count()
try:
nvidia_smi_output = subprocess.check_output(["nvidia-smi", "-L"], text=True)
all_devices = nvidia_smi_output.strip().split("\n")
cuda_visible_devices = os.getenv("CUDA_VISIBLE_DEVICES")
if cuda_visible_devices is not None:
logger.warning("CUDA_VISIBLE_DEVICES is ignored when torch is unavailable...")
return len(all_devices)
except Exception:
return 0
CUDA batch size default from `base_op.py:378-381`:
if self.accelerator == "cuda":
self.batch_size = kwargs.get("batch_size", 10)
else:
self.batch_size = kwargs.get("batch_size", DEFAULT_BATCH_SIZE) # 1000
GPU memory query from `process_utils.py:82-93`:
def get_min_cuda_memory():
import torch
min_cuda_memory = torch.cuda.get_device_properties(0).total_memory / 1024**2
nvidia_smi_output = subprocess.check_output(
["nvidia-smi", "--query-gpu=memory.free", "--format=csv,noheader,nounits"]
).decode("utf-8")
for line in nvidia_smi_output.strip().split("\n"):
free_memory = int(line)
min_cuda_memory = min(min_cuda_memory, free_memory)
return min_cuda_memory
vLLM environment setup from `model_utils.py:1203-1214`:
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
if is_ray_mode():
tensor_parallel_size = model_params.get("tensor_parallel_size", 1)
else:
tensor_parallel_size = model_params.get("tensor_parallel_size", torch.cuda.device_count())
model = vllm.LLM(model=check_model_home(pretrained_model_name_or_path),
generation_config="auto", **model_params)
GPU device assignment from `model_utils.py:1706-1711`:
if use_cuda and cuda_device_count() > 0:
rank = rank if rank is not None else 0
rank = rank % cuda_device_count()
device = f"cuda:{rank}"
else:
device = "cpu"
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Command nvidia-smi is not found` | No NVIDIA driver or GPU installed | Install NVIDIA drivers and CUDA toolkit |
| `video_camera_calibration_static_deepcalib_mapper currently supports GPU usage only` | Operator requires GPU but running on CPU | Ensure CUDA GPU is available and configured |
| `accelerate not found, using device directly` | HuggingFace accelerate not installed | `pip install accelerate` |
| `The required cuda memory and gpu of Op[X] has not been specified` | Operator missing GPU resource hints | Set `mem_required` and `num_gpus` in operator config |
| `CUDA out of memory` | Insufficient VRAM for operation | Reduce batch_size, use smaller model, or use GPU with more VRAM |
| `Failed to install mmcv` | mmcv build requires CUDA matching PyTorch | Ensure CUDA toolkit version matches PyTorch CUDA version |
Compatibility Notes
- Batch Size: CUDA operators default to batch_size=10 (vs 1000 for CPU) due to GPU memory constraints. This is a 100x reduction.
- Multiprocessing: CUDA operators automatically switch to `forkserver` or `spawn` multiprocessing start method (fork is unsafe with CUDA).
- vLLM Tensor Parallelism: In Ray mode, tensor_parallel_size defaults to 1 (single GPU per worker). In local mode, it defaults to all available GPUs.
- Model Caching: GPU models are cached in a global `MODEL_ZOO` dict keyed by model class. Call `torch.cuda.empty_cache()` after clearing the zoo.
- Auto-Install: Some operators auto-install dependencies via `openmim` (for mmcv, mmdeploy). This requires network access and a compatible CUDA toolkit.
- deepcalib Mapper: Uses TensorFlow with explicit GPU memory growth enabled to coexist with PyTorch GPU usage.