Implementation:NVIDIA DALI PyTorch Output Integration
| Knowledge Sources | |
|---|---|
| Domains | Image_Processing, GPU_Computing, Framework_Integration |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete utilities for converting DALI pipeline outputs into PyTorch CUDA tensors, provided by the nvidia.dali.plugin.pytorch module.
Description
NVIDIA DALI provides two complementary mechanisms for delivering GPU-processed data to PyTorch:
1. to_torch_tensor (Direct Conversion):
The to_torch_tensor function converts a single DALI TensorGPU object into a torch.Tensor on the same CUDA device. With copy=False, this is a zero-copy operation where the resulting PyTorch tensor shares the underlying GPU memory with the DALI output buffer. The tensor remains valid until the next pipe.run() call overwrites the buffer.
2. DALIServer + DataLoader (Proxy Integration):
The DALIServer class wraps a built DALI pipeline and exposes a proxy attribute that can be used as a transform callable in a PyTorch Dataset.__getitem__ method. The custom dali_proxy.DataLoader replaces the standard PyTorch DataLoader, internally coordinating data from multiple worker processes through the centralized DALI pipeline. This pattern allows existing PyTorch Dataset classes to gain GPU-accelerated preprocessing by simply wrapping the pipeline in a DALIServer and passing the server's proxy as a transform.
Key behaviors:
- to_torch_tensor(tensor, copy=False) performs zero-copy GPU memory sharing.
- DALIServer is used as a context manager (with statement) for proper GPU resource cleanup.
- dali_proxy.DataLoader accepts the same parameters as torch.utils.data.DataLoader (batch_size, num_workers, drop_last) plus the dali_server reference.
- The proxy pattern is compatible with multi-worker data loading and handles serialization of GPU pipeline access.
Usage
Use to_torch_tensor in dynamic execution pipelines where the caller manually runs the pipeline and processes outputs. Use DALIServer with dali_proxy.DataLoader in training loops that need multi-worker data loading with GPU-accelerated preprocessing integrated into the standard PyTorch DataLoader contract.
Code Reference
Source Location
- Repository: NVIDIA DALI
- File (to_torch_tensor): docs/examples/zoo/images/decode.py (lines 84-89)
- File (DALIServer + DataLoader): docs/examples/zoo/images/decode_and_transform_pytorch.py (lines 132-154)
Signature
# Direct conversion
to_torch_tensor(tensor, copy=False)
# Proxy integration
dali_server = dali_proxy.DALIServer(pipeline)
dataloader = dali_proxy.DataLoader(
dali_server,
dataset,
batch_size=batch_size,
num_workers=num_workers,
drop_last=True,
)
Import
# Direct conversion
from nvidia.dali.plugin.pytorch.torch_utils import to_torch_tensor
# Proxy integration
import nvidia.dali.plugin.pytorch.experimental.proxy as dali_proxy
# Provides: dali_proxy.DALIServer, dali_proxy.DataLoader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tensor (to_torch_tensor) | TensorGPU | Yes | A single DALI GPU tensor from pipeline output, e.g., decoded[0][0] |
| copy | bool | No | If False, zero-copy sharing of GPU memory. If True, allocates new PyTorch tensor and copies data. Default: True |
| pipeline (DALIServer) | Pipeline | Yes | A built DALI Pipeline instance to be wrapped by the server |
| dali_server (DataLoader) | DALIServer | Yes | The DALIServer instance managing the DALI pipeline |
| dataset (DataLoader) | Dataset | Yes | PyTorch Dataset whose transform uses dali_server.proxy |
| batch_size (DataLoader) | int | Yes | Number of samples per batch |
| num_workers (DataLoader) | int | No | Number of PyTorch worker processes for data loading. Default: 0 |
| drop_last (DataLoader) | bool | No | Drop the last incomplete batch. Default: False |
Outputs
| Name | Type | Description |
|---|---|---|
| torch_tensor (to_torch_tensor) | torch.Tensor | PyTorch CUDA tensor on the same device as the DALI output, with the same shape and dtype |
| batch (DataLoader iteration) | tuple(torch.Tensor, ...) | Tuple of PyTorch tensors as yielded by iterating the DataLoader, matching the Dataset's return structure |
Usage Examples
Example: Direct Zero-Copy Conversion
import numpy as np
from nvidia.dali.pipeline import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.plugin.pytorch.torch_utils import to_torch_tensor
@pipeline_def(batch_size=4, num_threads=4, device_id=0, exec_dynamic=True)
def decode_pipeline(source_name):
inputs = fn.external_source(
device="cpu", name=source_name,
no_copy=False, blocking=True, dtype=types.UINT8,
)
decoded = fn.decoders.image(
inputs, device="mixed", output_type=types.RGB,
jpeg_fancy_upsampling=True,
)
return decoded
pipe = decode_pipeline("encoded_img", prefetch_queue_depth=1)
pipe.build()
from pathlib import Path
for file_name in Path("images/").iterdir():
encoded = np.expand_dims(np.fromfile(file_name, dtype=np.uint8), axis=0)
decoded = pipe.run(encoded_img=encoded)
# Zero-copy conversion to PyTorch CUDA tensor
img_on_gpu = to_torch_tensor(decoded[0][0], copy=False)
print(img_on_gpu.shape, img_on_gpu.device)
Example: Proxy Integration with PyTorch DataLoader
import glob
import numpy as np
from pathlib import Path
from torch.utils.data import Dataset
import nvidia.dali.plugin.pytorch.experimental.proxy as dali_proxy
from nvidia.dali import pipeline_def, fn, types
class ImageDataset(Dataset):
def __init__(self, images_dir, transform=None):
self.images_dir = images_dir
self.transform = transform
self.image_ids = glob.glob("*.jpeg", root_dir=images_dir)
def __len__(self):
return len(self.image_ids)
def __getitem__(self, idx):
image_path = Path(self.images_dir) / self.image_ids[idx]
encoded_img = np.expand_dims(
np.fromfile(image_path, dtype=np.uint8), axis=0
)
if self.transform is not None:
return self.transform(encoded_img)
return encoded_img
@pipeline_def
def image_pipe(img_hw=(320, 200)):
encoded_images = fn.external_source(name="images", no_copy=True)
decoded = fn.decoders.image(
encoded_images, device="mixed", output_type=types.RGB,
)
images = fn.resize(decoded, size=img_hw, interp_type=types.INTERP_LINEAR)
return images
with dali_proxy.DALIServer(
image_pipe(batch_size=4, num_threads=4, device_id=0)
) as dali_server:
dataset = ImageDataset("images/", transform=dali_server.proxy)
dataloader = dali_proxy.DataLoader(
dali_server, dataset,
batch_size=4, num_workers=2, drop_last=True,
)
for images in dataloader:
print(f"Batch shape: {images.shape}") # torch.Tensor on CUDA