Implementation:NVIDIA DALI PyTorch Output Integration

Knowledge Sources	NVIDIA DALI
Domains	Image_Processing, GPU_Computing, Framework_Integration
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete utilities for converting DALI pipeline outputs into PyTorch CUDA tensors, provided by the nvidia.dali.plugin.pytorch module.

Description

NVIDIA DALI provides two complementary mechanisms for delivering GPU-processed data to PyTorch:

1. to_torch_tensor (Direct Conversion):

The to_torch_tensor function converts a single DALI TensorGPU object into a torch.Tensor on the same CUDA device. With copy=False, this is a zero-copy operation where the resulting PyTorch tensor shares the underlying GPU memory with the DALI output buffer. The tensor remains valid until the next pipe.run() call overwrites the buffer.

2. DALIServer + DataLoader (Proxy Integration):

The DALIServer class wraps a built DALI pipeline and exposes a proxy attribute that can be used as a transform callable in a PyTorch Dataset.__getitem__ method. The custom dali_proxy.DataLoader replaces the standard PyTorch DataLoader, internally coordinating data from multiple worker processes through the centralized DALI pipeline. This pattern allows existing PyTorch Dataset classes to gain GPU-accelerated preprocessing by simply wrapping the pipeline in a DALIServer and passing the server's proxy as a transform.

Key behaviors:

to_torch_tensor(tensor, copy=False) performs zero-copy GPU memory sharing.
DALIServer is used as a context manager (with statement) for proper GPU resource cleanup.
dali_proxy.DataLoader accepts the same parameters as torch.utils.data.DataLoader (batch_size, num_workers, drop_last) plus the dali_server reference.
The proxy pattern is compatible with multi-worker data loading and handles serialization of GPU pipeline access.

Usage

Use to_torch_tensor in dynamic execution pipelines where the caller manually runs the pipeline and processes outputs. Use DALIServer with dali_proxy.DataLoader in training loops that need multi-worker data loading with GPU-accelerated preprocessing integrated into the standard PyTorch DataLoader contract.

Code Reference

Source Location

Repository: NVIDIA DALI
File (to_torch_tensor): docs/examples/zoo/images/decode.py (lines 84-89)
File (DALIServer + DataLoader): docs/examples/zoo/images/decode_and_transform_pytorch.py (lines 132-154)

Signature

# Direct conversion
to_torch_tensor(tensor, copy=False)

# Proxy integration
dali_server = dali_proxy.DALIServer(pipeline)
dataloader = dali_proxy.DataLoader(
    dali_server,
    dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    drop_last=True,
)

Import

# Direct conversion
from nvidia.dali.plugin.pytorch.torch_utils import to_torch_tensor

# Proxy integration
import nvidia.dali.plugin.pytorch.experimental.proxy as dali_proxy
# Provides: dali_proxy.DALIServer, dali_proxy.DataLoader

I/O Contract

Inputs

Name	Type	Required	Description
tensor (to_torch_tensor)	TensorGPU	Yes	A single DALI GPU tensor from pipeline output, e.g., decoded[0][0]
copy	bool	No	If False, zero-copy sharing of GPU memory. If True, allocates new PyTorch tensor and copies data. Default: True
pipeline (DALIServer)	Pipeline	Yes	A built DALI Pipeline instance to be wrapped by the server
dali_server (DataLoader)	DALIServer	Yes	The DALIServer instance managing the DALI pipeline
dataset (DataLoader)	Dataset	Yes	PyTorch Dataset whose transform uses dali_server.proxy
batch_size (DataLoader)	int	Yes	Number of samples per batch
num_workers (DataLoader)	int	No	Number of PyTorch worker processes for data loading. Default: 0
drop_last (DataLoader)	bool	No	Drop the last incomplete batch. Default: False

Outputs

Name	Type	Description
torch_tensor (to_torch_tensor)	torch.Tensor	PyTorch CUDA tensor on the same device as the DALI output, with the same shape and dtype
batch (DataLoader iteration)	tuple(torch.Tensor, ...)	Tuple of PyTorch tensors as yielded by iterating the DataLoader, matching the Dataset's return structure

Usage Examples

Example: Direct Zero-Copy Conversion

import numpy as np
from nvidia.dali.pipeline import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.plugin.pytorch.torch_utils import to_torch_tensor

@pipeline_def(batch_size=4, num_threads=4, device_id=0, exec_dynamic=True)
def decode_pipeline(source_name):
    inputs = fn.external_source(
        device="cpu", name=source_name,
        no_copy=False, blocking=True, dtype=types.UINT8,
    )
    decoded = fn.decoders.image(
        inputs, device="mixed", output_type=types.RGB,
        jpeg_fancy_upsampling=True,
    )
    return decoded

pipe = decode_pipeline("encoded_img", prefetch_queue_depth=1)
pipe.build()

from pathlib import Path
for file_name in Path("images/").iterdir():
    encoded = np.expand_dims(np.fromfile(file_name, dtype=np.uint8), axis=0)
    decoded = pipe.run(encoded_img=encoded)
    # Zero-copy conversion to PyTorch CUDA tensor
    img_on_gpu = to_torch_tensor(decoded[0][0], copy=False)
    print(img_on_gpu.shape, img_on_gpu.device)

Example: Proxy Integration with PyTorch DataLoader

import glob
import numpy as np
from pathlib import Path
from torch.utils.data import Dataset

import nvidia.dali.plugin.pytorch.experimental.proxy as dali_proxy
from nvidia.dali import pipeline_def, fn, types

class ImageDataset(Dataset):
    def __init__(self, images_dir, transform=None):
        self.images_dir = images_dir
        self.transform = transform
        self.image_ids = glob.glob("*.jpeg", root_dir=images_dir)

    def __len__(self):
        return len(self.image_ids)

    def __getitem__(self, idx):
        image_path = Path(self.images_dir) / self.image_ids[idx]
        encoded_img = np.expand_dims(
            np.fromfile(image_path, dtype=np.uint8), axis=0
        )
        if self.transform is not None:
            return self.transform(encoded_img)
        return encoded_img

@pipeline_def
def image_pipe(img_hw=(320, 200)):
    encoded_images = fn.external_source(name="images", no_copy=True)
    decoded = fn.decoders.image(
        encoded_images, device="mixed", output_type=types.RGB,
    )
    images = fn.resize(decoded, size=img_hw, interp_type=types.INTERP_LINEAR)
    return images

with dali_proxy.DALIServer(
    image_pipe(batch_size=4, num_threads=4, device_id=0)
) as dali_server:
    dataset = ImageDataset("images/", transform=dali_server.proxy)
    dataloader = dali_proxy.DataLoader(
        dali_server, dataset,
        batch_size=4, num_workers=2, drop_last=True,
    )
    for images in dataloader:
        print(f"Batch shape: {images.shape}")  # torch.Tensor on CUDA

Related Pages

Implements Principle

Principle:NVIDIA_DALI_PyTorch_Data_Output

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment