Environment:NVIDIA DALI PyTorch Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, Computer_Vision |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
PyTorch >= 1.9.0 environment with `nvidia-dali-cuda120` for GPU-accelerated data loading via `DALIClassificationIterator` and `DALIGenericIterator`.
Description
This environment extends the base CUDA GPU environment with PyTorch-specific integration. DALI provides PyTorch iterator wrappers (`DALIClassificationIterator`, `DALIGenericIterator`) that produce `torch.Tensor` outputs directly on GPU, eliminating the CPU-to-GPU transfer bottleneck of standard `torch.utils.data.DataLoader`. The environment also supports PyTorch Distributed Data Parallel (DDP) with NCCL backend for multi-GPU training, and `torch.cuda.amp` for mixed-precision training with DALI data pipelines.
Usage
Use this environment for any PyTorch training workflow that uses DALI for data loading. This is the mandatory prerequisite for the DALIClassificationIterator, DALIGenericIterator, Train_Function_PyTorch, and PyTorch_Output_Integration implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (manylinux_2_28 compatible) | Same as CUDA GPU Environment |
| Hardware | NVIDIA GPU with CUDA support | Multi-GPU supported via NCCL |
| CUDA Toolkit | 12.0+ | Required for DALI GPU operators |
| Disk | 5GB+ SSD | For PyTorch + DALI packages |
Dependencies
System Packages
- CUDA Toolkit >= 12.0
- NCCL (for multi-GPU distributed training)
Python Packages
- `torch` >= 1.9.0
- `nvidia-dali-cuda120` >= 1.48.0
- `numpy` >= 2.0
- `pillow` (optional, for non-DALI image ops)
Credentials
The following environment variables are used for distributed training:
- `WORLD_SIZE`: Total number of distributed processes.
- `LOCAL_RANK`: Local GPU rank within the node.
- `RANK`: Global rank of the process.
- `MASTER_ADDR`: Address of the master node (for `init_method='env://'`).
- `MASTER_PORT`: Port on the master node.
Quick Install
# Install PyTorch (CUDA 12.x)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# Install DALI for CUDA 12.x
pip install --extra-index-url https://pypi.nvidia.com --upgrade nvidia-dali-cuda120
# Verify integration
python -c "from nvidia.dali.plugin.pytorch import DALIClassificationIterator; print('OK')"
Code Evidence
PyTorch DDP initialization from `docs/examples/use_cases/pytorch/resnet50/main.py:206-268`:
if 'WORLD_SIZE' in os.environ:
args.distributed = int(os.environ['WORLD_SIZE']) > 1
if args.distributed:
args.local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
args.world_size = torch.distributed.get_world_size()
Mixed precision (AMP) integration from `docs/examples/use_cases/pytorch/resnet50/main.py:480-601`:
from torch.cuda.amp import autocast, GradScaler
scaler = torch.cuda.amp.GradScaler(
init_scale=args.loss_scale,
growth_factor=2.0,
backoff_factor=0.5,
growth_interval=2000
)
with autocast(enabled=args.fp16_mode):
output = model(input)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
DALI iterator creation from `docs/examples/use_cases/pytorch/resnet50/main.py:312-331`:
pipe = create_dali_pipeline(
batch_size=batch_size,
num_threads=args.workers,
device_id=args.local_rank,
seed=12 + args.local_rank,
shard_id=args.local_rank,
num_shards=args.world_size,
pad_last_batch=True,
is_training=True,
)
pipe.build()
train_loader = DALIClassificationIterator(
pipe,
reader_name="Reader",
last_batch_policy=LastBatchPolicy.PARTIAL,
auto_reset=True,
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: cannot import name 'DALIClassificationIterator'` | DALI not installed or wrong CUDA version | `pip install --extra-index-url https://pypi.nvidia.com nvidia-dali-cuda120` |
| `RuntimeError: NCCL error` | NCCL communication failure in distributed training | Verify MASTER_ADDR, MASTER_PORT are set; check network connectivity |
| `CUDA out of memory` | Batch size too large for GPU VRAM | Reduce `batch_size` or use gradient accumulation |
| `StopIteration` not resetting | Iterator exhausted without auto-reset | Set `auto_reset=True` in DALIClassificationIterator |
Compatibility Notes
- PyTorch DDP: DALI handles data sharding internally via `shard_id`/`num_shards`. Do not use PyTorch's `DistributedSampler` with DALI iterators.
- torch.cuda.amp: Fully compatible with DALI output tensors. DALI outputs float32 by default; AMP autocast handles the rest.
- Learning Rate Scaling: When using DDP, scale learning rate by `batch_size * world_size / 256` (256 is the standard reference batch size).
- NGC Containers: DALI is preinstalled in NVIDIA NGC PyTorch containers.
- torchrun: Use `torchrun --nproc_per_node=N` to launch distributed training; environment variables are set automatically.