Principle:NVIDIA DALI Video Spatial Crop

Knowledge Sources	NVIDIA DALI Documentation
Domains	Video_Processing, GPU_Computing, Data_Augmentation
Last Updated	2026-02-08 00:00 GMT

Overview

Video spatial cropping is the process of extracting a fixed-size spatial region from all frames of a video sequence, applying the same random crop position uniformly across the temporal dimension to preserve spatial coherence.

Description

Video Spatial Cropping extracts a rectangular sub-region from video frames, reducing the spatial resolution of each frame to a specified target size while retaining all frames in the temporal sequence. The critical distinction from image cropping is that the same spatial crop coordinates must be applied consistently across every frame in the sequence. If different random crop positions were used for each frame, the resulting sequence would exhibit artificial spatial jitter that would corrupt the temporal coherence required by video models.

In the DALI pipeline, the crop operation achieves two simultaneous objectives:

Spatial augmentation: By randomizing the crop position (via uniform random sampling in both the X and Y axes, normalized to the [0.0, 1.0] range), the pipeline ensures that the model sees different spatial regions of the source video during each training epoch. This prevents the model from overfitting to specific spatial locations within the training videos.

Resolution standardization: Video files in the dataset may have slightly varying resolutions. The crop operation normalizes all sequences to a fixed spatial dimension (e.g., 256x256), ensuring that every sample in a batch has identical spatial dimensions, which is a requirement for batched tensor operations.

Additionally, the crop operation performs type promotion: the input UINT8 pixel values (range [0, 255]) are cast to FLOAT (range [0.0, 255.0]) during the crop, preparing the data for neural network consumption where floating-point arithmetic is required.

Usage

Use spatial cropping on video sequences when:

The model requires fixed-size spatial input but the source videos have varying or larger spatial dimensions
Random spatial augmentation is desired to improve generalization
Type conversion from integer pixel values to floating-point is needed
Temporal coherence must be maintained (the same crop position across all frames)

This operation is typically placed immediately after the video reader and before any layout transposition in the DALI pipeline.

Theoretical Basis

Random spatial cropping is a form of data augmentation grounded in the assumption that visual features learned by the model should be translation-invariant. By presenting random spatial windows of the same video during different training iterations, the model is forced to learn features that generalize across spatial positions rather than memorizing position-specific patterns.

The requirement for temporally consistent cropping (same crop position for all frames in a sequence) is rooted in the physical consistency of video. In real-world video, objects do not teleport between frames; they move smoothly through space. Applying different random crops to each frame would violate this physical prior and generate training examples that do not resemble any real video, potentially degrading model performance.

The use of normalized crop coordinates (0.0 to 1.0 rather than absolute pixel positions) provides resolution-independent augmentation. The same crop position parameter produces semantically equivalent crops regardless of the input resolution, which is important when training on multi-resolution data.

Related Pages

Implemented By

Implementation:NVIDIA_DALI_Fn_Crop

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment