Environment:NVIDIA DALI PyTorch Environment

Knowledge Sources	NVIDIA DALI DALI PyTorch Plugin
Domains	Infrastructure, Deep_Learning, Computer_Vision
Last Updated	2026-02-08 16:00 GMT

Overview

PyTorch >= 1.9.0 environment with `nvidia-dali-cuda120` for GPU-accelerated data loading via `DALIClassificationIterator` and `DALIGenericIterator`.

Description

This environment extends the base CUDA GPU environment with PyTorch-specific integration. DALI provides PyTorch iterator wrappers (`DALIClassificationIterator`, `DALIGenericIterator`) that produce `torch.Tensor` outputs directly on GPU, eliminating the CPU-to-GPU transfer bottleneck of standard `torch.utils.data.DataLoader`. The environment also supports PyTorch Distributed Data Parallel (DDP) with NCCL backend for multi-GPU training, and `torch.cuda.amp` for mixed-precision training with DALI data pipelines.

Usage

Use this environment for any PyTorch training workflow that uses DALI for data loading. This is the mandatory prerequisite for the DALIClassificationIterator, DALIGenericIterator, Train_Function_PyTorch, and PyTorch_Output_Integration implementations.

System Requirements

Category	Requirement	Notes
OS	Linux (manylinux_2_28 compatible)	Same as CUDA GPU Environment
Hardware	NVIDIA GPU with CUDA support	Multi-GPU supported via NCCL
CUDA Toolkit	12.0+	Required for DALI GPU operators
Disk	5GB+ SSD	For PyTorch + DALI packages

Dependencies

System Packages

CUDA Toolkit >= 12.0
NCCL (for multi-GPU distributed training)

Python Packages

`torch` >= 1.9.0
`nvidia-dali-cuda120` >= 1.48.0
`numpy` >= 2.0
`pillow` (optional, for non-DALI image ops)

Credentials

The following environment variables are used for distributed training:

`WORLD_SIZE`: Total number of distributed processes.
`LOCAL_RANK`: Local GPU rank within the node.
`RANK`: Global rank of the process.
`MASTER_ADDR`: Address of the master node (for `init_method='env://'`).
`MASTER_PORT`: Port on the master node.

Quick Install

# Install PyTorch (CUDA 12.x)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# Install DALI for CUDA 12.x
pip install --extra-index-url https://pypi.nvidia.com --upgrade nvidia-dali-cuda120

# Verify integration
python -c "from nvidia.dali.plugin.pytorch import DALIClassificationIterator; print('OK')"

Code Evidence

PyTorch DDP initialization from `docs/examples/use_cases/pytorch/resnet50/main.py:206-268`:

if 'WORLD_SIZE' in os.environ:
    args.distributed = int(os.environ['WORLD_SIZE']) > 1

if args.distributed:
    args.local_rank = int(os.environ['LOCAL_RANK'])
    torch.cuda.set_device(args.local_rank)
    torch.distributed.init_process_group(backend='nccl', init_method='env://')
    args.world_size = torch.distributed.get_world_size()

Mixed precision (AMP) integration from `docs/examples/use_cases/pytorch/resnet50/main.py:480-601`:

from torch.cuda.amp import autocast, GradScaler

scaler = torch.cuda.amp.GradScaler(
    init_scale=args.loss_scale,
    growth_factor=2.0,
    backoff_factor=0.5,
    growth_interval=2000
)

with autocast(enabled=args.fp16_mode):
    output = model(input)
    loss = criterion(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

DALI iterator creation from `docs/examples/use_cases/pytorch/resnet50/main.py:312-331`:

pipe = create_dali_pipeline(
    batch_size=batch_size,
    num_threads=args.workers,
    device_id=args.local_rank,
    seed=12 + args.local_rank,
    shard_id=args.local_rank,
    num_shards=args.world_size,
    pad_last_batch=True,
    is_training=True,
)
pipe.build()
train_loader = DALIClassificationIterator(
    pipe,
    reader_name="Reader",
    last_batch_policy=LastBatchPolicy.PARTIAL,
    auto_reset=True,
)

Common Errors

Error Message	Cause	Solution
`ImportError: cannot import name 'DALIClassificationIterator'`	DALI not installed or wrong CUDA version	`pip install --extra-index-url https://pypi.nvidia.com nvidia-dali-cuda120`
`RuntimeError: NCCL error`	NCCL communication failure in distributed training	Verify MASTER_ADDR, MASTER_PORT are set; check network connectivity
`CUDA out of memory`	Batch size too large for GPU VRAM	Reduce `batch_size` or use gradient accumulation
`StopIteration` not resetting	Iterator exhausted without auto-reset	Set `auto_reset=True` in DALIClassificationIterator

Compatibility Notes

PyTorch DDP: DALI handles data sharding internally via `shard_id`/`num_shards`. Do not use PyTorch's `DistributedSampler` with DALI iterators.
torch.cuda.amp: Fully compatible with DALI output tensors. DALI outputs float32 by default; AMP autocast handles the rest.
Learning Rate Scaling: When using DDP, scale learning rate by `batch_size * world_size / 256` (256 is the standard reference batch size).
NGC Containers: DALI is preinstalled in NVIDIA NGC PyTorch containers.
torchrun: Use `torchrun --nproc_per_node=N` to launch distributed training; environment variables are set automatically.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment