Workflow:NVIDIA DALI Video Processing Pipeline

Knowledge Sources	NVIDIA DALI DALI Documentation DALI Video Reader
Domains	Data_Loading, Video_Processing, Deep_Learning, PyTorch
Last Updated	2026-02-08 17:00 GMT

Overview

End-to-end process for GPU-accelerated video frame sequence loading and preprocessing using NVIDIA DALI pipelines for video-based deep learning models.

Description

This workflow defines the procedure for loading video data (MP4, H.264) using DALI's hardware-accelerated video reader, extracting temporal frame sequences, and preprocessing them for models that consume multi-frame inputs such as video super-resolution networks, temporal shift modules, or action recognition models. The DALI video reader leverages NVDEC hardware to decode video frames directly on GPU, avoiding CPU-GPU transfers of decoded pixel data. Frame sequences are cropped, resized, and transposed to match model input expectations.

Usage

Execute this workflow when you need to train or evaluate a video-based deep learning model and want to accelerate the data loading pipeline. This is appropriate when working with video files (MP4, H.264, VP9, HEVC) and models that require sequences of consecutive frames as input, such as video super-resolution, action recognition, or temporal segmentation models.

Execution Steps

Step 1: Prepare Video Data

Organize video files into the expected directory structure. For training, split source videos into individual scene clips and optionally transcode them to the target resolution and codec. Create separate directories for training and validation sets.

Key considerations:

Videos should be in formats supported by DALI's video reader (H.264, VP9, HEVC)
Scene splitting ensures clips contain continuous temporal content
Transcoding to a target resolution reduces decode overhead if the source resolution is much larger than needed

Step 2: Define the DALI Video Pipeline

Create a pipeline using @pipeline_def that uses fn.readers.video to read sequences of consecutive frames from video files. The video reader extracts fixed-length frame subsequences from each video, returning them as 4D tensors (frames x height x width x channels).

Key considerations:

Configure sequence_length to match the number of input frames required by the model
Use device="gpu" for hardware-accelerated decoding via NVDEC
Set random_shuffle=True for training and configure sharding for multi-GPU
The reader returns batches of frame sequences, not individual frames

Step 3: Apply Spatial Augmentations

Apply spatial transformations to the frame sequences. Use fn.crop with random anchor positions for spatial cropping. All frames in a sequence receive the same spatial transformation to maintain temporal consistency.

Key considerations:

Generate random crop positions using fn.random.uniform and apply them uniformly across the sequence
The crop size should match the model's expected spatial input dimensions
Ensure consistent augmentation across all frames in a temporal sequence

Step 4: Transpose to Model Layout

Reorder tensor dimensions from the default DALI layout (batch x frames x height x width x channels) to the layout expected by the model. Most video models expect (batch x channels x frames x height x width) ordering.

Key considerations:

Use fn.transpose to reorder dimensions
The target permutation depends on the model framework and architecture conventions
This step is a zero-copy metadata operation when possible

Step 5: Create Iterator and Train

Build the pipeline and wrap it in a DALIGenericIterator to expose batches as PyTorch tensors. Iterate through the data in the standard training loop.

Key considerations:

Use DALIGenericIterator with named outputs matching the pipeline return values
Configure LastBatchPolicy.PARTIAL for correct epoch boundaries
Data arrives on GPU ready for model consumption
Track epoch size through the iterator for learning rate scheduling

Execution Diagram

GitHub URL

Workflow Repository