Environment:Datajuicer Data juicer GPU CUDA Environment

Knowledge Sources	Data-Juicer NVIDIA CUDA
Domains	Infrastructure, Deep_Learning, Computer_Vision
Last Updated	2026-02-14 17:00 GMT

Overview

NVIDIA GPU environment with CUDA support, PyTorch 2.8.0, and optional vLLM 0.11.0 for GPU-accelerated operators including ML model inference, image/video processing, and LLM serving.

Description

This environment provides GPU-accelerated execution for Data-Juicer operators that require CUDA hardware. It includes PyTorch 2.8.0 with CUDA support, transformers 4.57.1 for model loading, and optional vLLM 0.11.0 for high-throughput LLM inference. The system uses multi-level GPU detection (Ray cluster query, PyTorch CUDA API, nvidia-smi fallback) and automatic resource allocation that distributes operators across available GPUs based on memory requirements.

Usage

Use this environment when running GPU-accelerated operators such as image aesthetics filters, NSFW detectors, text embedding similarity filters, LLM-based mappers, video processing operators, and vLLM inference pipelines. Operators with `accelerator: cuda` in their configuration automatically trigger GPU execution. The default CUDA batch size is 10 (vs 1000 for CPU) to manage GPU memory constraints.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+)	Windows not officially supported for CUDA operations
Hardware	NVIDIA GPU	Minimum 4GB VRAM; 16GB+ recommended for LLM inference
Driver	NVIDIA Driver 525+	Compatible with CUDA 12.x
CUDA	CUDA 12.x toolkit	Required for PyTorch 2.8.0 and vLLM
RAM	16GB+ system RAM	GPU operators often need CPU staging memory

Dependencies

System Packages

NVIDIA CUDA Toolkit 12.x
NVIDIA cuDNN
`nvidia-smi` CLI tool (for resource detection)

Python Packages (generic extra)

`torch` == 2.8.0
`transformers` == 4.57.1
`einops`
`accelerate`
`onnxruntime`
`cudf-cu12` == 25.4.0

Python Packages (vLLM for LLM inference)

`vllm` == 0.11.0
`uvloop` == 0.21.0

Computer Vision Packages (vision extra)

`opencv-python`, `opencv-contrib-python`
`diffusers` >= 0.33.0
`ultralytics`
`rembg`
`decord`
`timm` == 1.0.22

Optional ML Frameworks

`openmim` (auto-installed for mmpose operators)
`mmcv` == 2.1.0 (auto-installed via mim)
`mmdeploy` (auto-installed via mim)

Credentials

`CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible to the process
`VLLM_WORKER_MULTIPROC_METHOD`: Set to `spawn` automatically for vLLM workers
`OMP_NUM_THREADS`: Thread count (auto-set to 1 to prevent multiprocessing hangs)

Quick Install

# Install with generic ML extras
pip install "py-data-juicer[generic]"

# Install with vision extras for image/video processing
pip install "py-data-juicer[generic,vision]"

# Install with all extras
pip install "py-data-juicer[all]"

# Verify CUDA availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}')"

Code Evidence

Multi-level CUDA detection from `resource_utils.py:66-88`:

def _cuda_device_count(cfg=None):
    _torch_available = _is_package_available("torch")
    if check_and_initialize_ray(cfg):
        return int(ray_gpu_count())
    if _torch_available:
        return torch.cuda.device_count()
    try:
        nvidia_smi_output = subprocess.check_output(["nvidia-smi", "-L"], text=True)
        all_devices = nvidia_smi_output.strip().split("\n")
        cuda_visible_devices = os.getenv("CUDA_VISIBLE_DEVICES")
        if cuda_visible_devices is not None:
            logger.warning("CUDA_VISIBLE_DEVICES is ignored when torch is unavailable...")
        return len(all_devices)
    except Exception:
        return 0

CUDA batch size default from `base_op.py:378-381`:

if self.accelerator == "cuda":
    self.batch_size = kwargs.get("batch_size", 10)
else:
    self.batch_size = kwargs.get("batch_size", DEFAULT_BATCH_SIZE)  # 1000

GPU memory query from `process_utils.py:82-93`:

def get_min_cuda_memory():
    import torch
    min_cuda_memory = torch.cuda.get_device_properties(0).total_memory / 1024**2
    nvidia_smi_output = subprocess.check_output(
        ["nvidia-smi", "--query-gpu=memory.free", "--format=csv,noheader,nounits"]
    ).decode("utf-8")
    for line in nvidia_smi_output.strip().split("\n"):
        free_memory = int(line)
        min_cuda_memory = min(min_cuda_memory, free_memory)
    return min_cuda_memory

vLLM environment setup from `model_utils.py:1203-1214`:

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
if is_ray_mode():
    tensor_parallel_size = model_params.get("tensor_parallel_size", 1)
else:
    tensor_parallel_size = model_params.get("tensor_parallel_size", torch.cuda.device_count())
model = vllm.LLM(model=check_model_home(pretrained_model_name_or_path),
                  generation_config="auto", **model_params)

GPU device assignment from `model_utils.py:1706-1711`:

if use_cuda and cuda_device_count() > 0:
    rank = rank if rank is not None else 0
    rank = rank % cuda_device_count()
    device = f"cuda:{rank}"
else:
    device = "cpu"

Common Errors

Error Message	Cause	Solution
`Command nvidia-smi is not found`	No NVIDIA driver or GPU installed	Install NVIDIA drivers and CUDA toolkit
`video_camera_calibration_static_deepcalib_mapper currently supports GPU usage only`	Operator requires GPU but running on CPU	Ensure CUDA GPU is available and configured
`accelerate not found, using device directly`	HuggingFace accelerate not installed	`pip install accelerate`
`The required cuda memory and gpu of Op[X] has not been specified`	Operator missing GPU resource hints	Set `mem_required` and `num_gpus` in operator config
`CUDA out of memory`	Insufficient VRAM for operation	Reduce batch_size, use smaller model, or use GPU with more VRAM
`Failed to install mmcv`	mmcv build requires CUDA matching PyTorch	Ensure CUDA toolkit version matches PyTorch CUDA version

Compatibility Notes

Batch Size: CUDA operators default to batch_size=10 (vs 1000 for CPU) due to GPU memory constraints. This is a 100x reduction.
Multiprocessing: CUDA operators automatically switch to `forkserver` or `spawn` multiprocessing start method (fork is unsafe with CUDA).
vLLM Tensor Parallelism: In Ray mode, tensor_parallel_size defaults to 1 (single GPU per worker). In local mode, it defaults to all available GPUs.
Model Caching: GPU models are cached in a global `MODEL_ZOO` dict keyed by model class. Call `torch.cuda.empty_cache()` after clearing the zoo.
Auto-Install: Some operators auto-install dependencies via `openmim` (for mmcv, mmdeploy). This requires network access and a compatible CUDA toolkit.
deepcalib Mapper: Uses TensorFlow with explicit GPU memory growth enabled to coexist with PyTorch GPU usage.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment