Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Datajuicer Data juicer GPU CUDA Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Deep_Learning, Computer_Vision
Last Updated 2026-02-14 17:00 GMT

Overview

NVIDIA GPU environment with CUDA support, PyTorch 2.8.0, and optional vLLM 0.11.0 for GPU-accelerated operators including ML model inference, image/video processing, and LLM serving.

Description

This environment provides GPU-accelerated execution for Data-Juicer operators that require CUDA hardware. It includes PyTorch 2.8.0 with CUDA support, transformers 4.57.1 for model loading, and optional vLLM 0.11.0 for high-throughput LLM inference. The system uses multi-level GPU detection (Ray cluster query, PyTorch CUDA API, nvidia-smi fallback) and automatic resource allocation that distributes operators across available GPUs based on memory requirements.

Usage

Use this environment when running GPU-accelerated operators such as image aesthetics filters, NSFW detectors, text embedding similarity filters, LLM-based mappers, video processing operators, and vLLM inference pipelines. Operators with `accelerator: cuda` in their configuration automatically trigger GPU execution. The default CUDA batch size is 10 (vs 1000 for CPU) to manage GPU memory constraints.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+) Windows not officially supported for CUDA operations
Hardware NVIDIA GPU Minimum 4GB VRAM; 16GB+ recommended for LLM inference
Driver NVIDIA Driver 525+ Compatible with CUDA 12.x
CUDA CUDA 12.x toolkit Required for PyTorch 2.8.0 and vLLM
RAM 16GB+ system RAM GPU operators often need CPU staging memory

Dependencies

System Packages

  • NVIDIA CUDA Toolkit 12.x
  • NVIDIA cuDNN
  • `nvidia-smi` CLI tool (for resource detection)

Python Packages (generic extra)

  • `torch` == 2.8.0
  • `transformers` == 4.57.1
  • `einops`
  • `accelerate`
  • `onnxruntime`
  • `cudf-cu12` == 25.4.0

Python Packages (vLLM for LLM inference)

  • `vllm` == 0.11.0
  • `uvloop` == 0.21.0

Computer Vision Packages (vision extra)

  • `opencv-python`, `opencv-contrib-python`
  • `diffusers` >= 0.33.0
  • `ultralytics`
  • `rembg`
  • `decord`
  • `timm` == 1.0.22

Optional ML Frameworks

  • `openmim` (auto-installed for mmpose operators)
  • `mmcv` == 2.1.0 (auto-installed via mim)
  • `mmdeploy` (auto-installed via mim)

Credentials

  • `CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible to the process
  • `VLLM_WORKER_MULTIPROC_METHOD`: Set to `spawn` automatically for vLLM workers
  • `OMP_NUM_THREADS`: Thread count (auto-set to 1 to prevent multiprocessing hangs)

Quick Install

# Install with generic ML extras
pip install "py-data-juicer[generic]"

# Install with vision extras for image/video processing
pip install "py-data-juicer[generic,vision]"

# Install with all extras
pip install "py-data-juicer[all]"

# Verify CUDA availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}')"

Code Evidence

Multi-level CUDA detection from `resource_utils.py:66-88`:

def _cuda_device_count(cfg=None):
    _torch_available = _is_package_available("torch")
    if check_and_initialize_ray(cfg):
        return int(ray_gpu_count())
    if _torch_available:
        return torch.cuda.device_count()
    try:
        nvidia_smi_output = subprocess.check_output(["nvidia-smi", "-L"], text=True)
        all_devices = nvidia_smi_output.strip().split("\n")
        cuda_visible_devices = os.getenv("CUDA_VISIBLE_DEVICES")
        if cuda_visible_devices is not None:
            logger.warning("CUDA_VISIBLE_DEVICES is ignored when torch is unavailable...")
        return len(all_devices)
    except Exception:
        return 0

CUDA batch size default from `base_op.py:378-381`:

if self.accelerator == "cuda":
    self.batch_size = kwargs.get("batch_size", 10)
else:
    self.batch_size = kwargs.get("batch_size", DEFAULT_BATCH_SIZE)  # 1000

GPU memory query from `process_utils.py:82-93`:

def get_min_cuda_memory():
    import torch
    min_cuda_memory = torch.cuda.get_device_properties(0).total_memory / 1024**2
    nvidia_smi_output = subprocess.check_output(
        ["nvidia-smi", "--query-gpu=memory.free", "--format=csv,noheader,nounits"]
    ).decode("utf-8")
    for line in nvidia_smi_output.strip().split("\n"):
        free_memory = int(line)
        min_cuda_memory = min(min_cuda_memory, free_memory)
    return min_cuda_memory

vLLM environment setup from `model_utils.py:1203-1214`:

os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
if is_ray_mode():
    tensor_parallel_size = model_params.get("tensor_parallel_size", 1)
else:
    tensor_parallel_size = model_params.get("tensor_parallel_size", torch.cuda.device_count())
model = vllm.LLM(model=check_model_home(pretrained_model_name_or_path),
                  generation_config="auto", **model_params)

GPU device assignment from `model_utils.py:1706-1711`:

if use_cuda and cuda_device_count() > 0:
    rank = rank if rank is not None else 0
    rank = rank % cuda_device_count()
    device = f"cuda:{rank}"
else:
    device = "cpu"

Common Errors

Error Message Cause Solution
`Command nvidia-smi is not found` No NVIDIA driver or GPU installed Install NVIDIA drivers and CUDA toolkit
`video_camera_calibration_static_deepcalib_mapper currently supports GPU usage only` Operator requires GPU but running on CPU Ensure CUDA GPU is available and configured
`accelerate not found, using device directly` HuggingFace accelerate not installed `pip install accelerate`
`The required cuda memory and gpu of Op[X] has not been specified` Operator missing GPU resource hints Set `mem_required` and `num_gpus` in operator config
`CUDA out of memory` Insufficient VRAM for operation Reduce batch_size, use smaller model, or use GPU with more VRAM
`Failed to install mmcv` mmcv build requires CUDA matching PyTorch Ensure CUDA toolkit version matches PyTorch CUDA version

Compatibility Notes

  • Batch Size: CUDA operators default to batch_size=10 (vs 1000 for CPU) due to GPU memory constraints. This is a 100x reduction.
  • Multiprocessing: CUDA operators automatically switch to `forkserver` or `spawn` multiprocessing start method (fork is unsafe with CUDA).
  • vLLM Tensor Parallelism: In Ray mode, tensor_parallel_size defaults to 1 (single GPU per worker). In local mode, it defaults to all available GPUs.
  • Model Caching: GPU models are cached in a global `MODEL_ZOO` dict keyed by model class. Call `torch.cuda.empty_cache()` after clearing the zoo.
  • Auto-Install: Some operators auto-install dependencies via `openmim` (for mmcv, mmdeploy). This requires network access and a compatible CUDA toolkit.
  • deepcalib Mapper: Uses TensorFlow with explicit GPU memory growth enabled to coexist with PyTorch GPU usage.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment