Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:NVIDIA DALI PyTorch Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning, Computer_Vision
Last Updated 2026-02-08 16:00 GMT

Overview

PyTorch >= 1.9.0 environment with `nvidia-dali-cuda120` for GPU-accelerated data loading via `DALIClassificationIterator` and `DALIGenericIterator`.

Description

This environment extends the base CUDA GPU environment with PyTorch-specific integration. DALI provides PyTorch iterator wrappers (`DALIClassificationIterator`, `DALIGenericIterator`) that produce `torch.Tensor` outputs directly on GPU, eliminating the CPU-to-GPU transfer bottleneck of standard `torch.utils.data.DataLoader`. The environment also supports PyTorch Distributed Data Parallel (DDP) with NCCL backend for multi-GPU training, and `torch.cuda.amp` for mixed-precision training with DALI data pipelines.

Usage

Use this environment for any PyTorch training workflow that uses DALI for data loading. This is the mandatory prerequisite for the DALIClassificationIterator, DALIGenericIterator, Train_Function_PyTorch, and PyTorch_Output_Integration implementations.

System Requirements

Category Requirement Notes
OS Linux (manylinux_2_28 compatible) Same as CUDA GPU Environment
Hardware NVIDIA GPU with CUDA support Multi-GPU supported via NCCL
CUDA Toolkit 12.0+ Required for DALI GPU operators
Disk 5GB+ SSD For PyTorch + DALI packages

Dependencies

System Packages

  • CUDA Toolkit >= 12.0
  • NCCL (for multi-GPU distributed training)

Python Packages

  • `torch` >= 1.9.0
  • `nvidia-dali-cuda120` >= 1.48.0
  • `numpy` >= 2.0
  • `pillow` (optional, for non-DALI image ops)

Credentials

The following environment variables are used for distributed training:

  • `WORLD_SIZE`: Total number of distributed processes.
  • `LOCAL_RANK`: Local GPU rank within the node.
  • `RANK`: Global rank of the process.
  • `MASTER_ADDR`: Address of the master node (for `init_method='env://'`).
  • `MASTER_PORT`: Port on the master node.

Quick Install

# Install PyTorch (CUDA 12.x)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# Install DALI for CUDA 12.x
pip install --extra-index-url https://pypi.nvidia.com --upgrade nvidia-dali-cuda120

# Verify integration
python -c "from nvidia.dali.plugin.pytorch import DALIClassificationIterator; print('OK')"

Code Evidence

PyTorch DDP initialization from `docs/examples/use_cases/pytorch/resnet50/main.py:206-268`:

if 'WORLD_SIZE' in os.environ:
    args.distributed = int(os.environ['WORLD_SIZE']) > 1

if args.distributed:
    args.local_rank = int(os.environ['LOCAL_RANK'])
    torch.cuda.set_device(args.local_rank)
    torch.distributed.init_process_group(backend='nccl', init_method='env://')
    args.world_size = torch.distributed.get_world_size()

Mixed precision (AMP) integration from `docs/examples/use_cases/pytorch/resnet50/main.py:480-601`:

from torch.cuda.amp import autocast, GradScaler

scaler = torch.cuda.amp.GradScaler(
    init_scale=args.loss_scale,
    growth_factor=2.0,
    backoff_factor=0.5,
    growth_interval=2000
)

with autocast(enabled=args.fp16_mode):
    output = model(input)
    loss = criterion(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

DALI iterator creation from `docs/examples/use_cases/pytorch/resnet50/main.py:312-331`:

pipe = create_dali_pipeline(
    batch_size=batch_size,
    num_threads=args.workers,
    device_id=args.local_rank,
    seed=12 + args.local_rank,
    shard_id=args.local_rank,
    num_shards=args.world_size,
    pad_last_batch=True,
    is_training=True,
)
pipe.build()
train_loader = DALIClassificationIterator(
    pipe,
    reader_name="Reader",
    last_batch_policy=LastBatchPolicy.PARTIAL,
    auto_reset=True,
)

Common Errors

Error Message Cause Solution
`ImportError: cannot import name 'DALIClassificationIterator'` DALI not installed or wrong CUDA version `pip install --extra-index-url https://pypi.nvidia.com nvidia-dali-cuda120`
`RuntimeError: NCCL error` NCCL communication failure in distributed training Verify MASTER_ADDR, MASTER_PORT are set; check network connectivity
`CUDA out of memory` Batch size too large for GPU VRAM Reduce `batch_size` or use gradient accumulation
`StopIteration` not resetting Iterator exhausted without auto-reset Set `auto_reset=True` in DALIClassificationIterator

Compatibility Notes

  • PyTorch DDP: DALI handles data sharding internally via `shard_id`/`num_shards`. Do not use PyTorch's `DistributedSampler` with DALI iterators.
  • torch.cuda.amp: Fully compatible with DALI output tensors. DALI outputs float32 by default; AMP autocast handles the rest.
  • Learning Rate Scaling: When using DDP, scale learning rate by `batch_size * world_size / 256` (256 is the standard reference batch size).
  • NGC Containers: DALI is preinstalled in NVIDIA NGC PyTorch containers.
  • torchrun: Use `torchrun --nproc_per_node=N` to launch distributed training; environment variables are set automatically.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment