Workflow:NVIDIA DALI Video Processing Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Video_Processing, Deep_Learning, PyTorch |
| Last Updated | 2026-02-08 17:00 GMT |
Overview
End-to-end process for GPU-accelerated video frame sequence loading and preprocessing using NVIDIA DALI pipelines for video-based deep learning models.
Description
This workflow defines the procedure for loading video data (MP4, H.264) using DALI's hardware-accelerated video reader, extracting temporal frame sequences, and preprocessing them for models that consume multi-frame inputs such as video super-resolution networks, temporal shift modules, or action recognition models. The DALI video reader leverages NVDEC hardware to decode video frames directly on GPU, avoiding CPU-GPU transfers of decoded pixel data. Frame sequences are cropped, resized, and transposed to match model input expectations.
Usage
Execute this workflow when you need to train or evaluate a video-based deep learning model and want to accelerate the data loading pipeline. This is appropriate when working with video files (MP4, H.264, VP9, HEVC) and models that require sequences of consecutive frames as input, such as video super-resolution, action recognition, or temporal segmentation models.
Execution Steps
Step 1: Prepare Video Data
Organize video files into the expected directory structure. For training, split source videos into individual scene clips and optionally transcode them to the target resolution and codec. Create separate directories for training and validation sets.
Key considerations:
- Videos should be in formats supported by DALI's video reader (H.264, VP9, HEVC)
- Scene splitting ensures clips contain continuous temporal content
- Transcoding to a target resolution reduces decode overhead if the source resolution is much larger than needed
Step 2: Define the DALI Video Pipeline
Create a pipeline using @pipeline_def that uses fn.readers.video to read sequences of consecutive frames from video files. The video reader extracts fixed-length frame subsequences from each video, returning them as 4D tensors (frames x height x width x channels).
Key considerations:
- Configure sequence_length to match the number of input frames required by the model
- Use device="gpu" for hardware-accelerated decoding via NVDEC
- Set random_shuffle=True for training and configure sharding for multi-GPU
- The reader returns batches of frame sequences, not individual frames
Step 3: Apply Spatial Augmentations
Apply spatial transformations to the frame sequences. Use fn.crop with random anchor positions for spatial cropping. All frames in a sequence receive the same spatial transformation to maintain temporal consistency.
Key considerations:
- Generate random crop positions using fn.random.uniform and apply them uniformly across the sequence
- The crop size should match the model's expected spatial input dimensions
- Ensure consistent augmentation across all frames in a temporal sequence
Step 4: Transpose to Model Layout
Reorder tensor dimensions from the default DALI layout (batch x frames x height x width x channels) to the layout expected by the model. Most video models expect (batch x channels x frames x height x width) ordering.
Key considerations:
- Use fn.transpose to reorder dimensions
- The target permutation depends on the model framework and architecture conventions
- This step is a zero-copy metadata operation when possible
Step 5: Create Iterator and Train
Build the pipeline and wrap it in a DALIGenericIterator to expose batches as PyTorch tensors. Iterate through the data in the standard training loop.
Key considerations:
- Use DALIGenericIterator with named outputs matching the pipeline return values
- Configure LastBatchPolicy.PARTIAL for correct epoch boundaries
- Data arrives on GPU ready for model consumption
- Track epoch size through the iterator for learning rate scheduling