Heuristic:NVIDIA NeMo Curator Video Frame Sampling Strategy
| Knowledge Sources | |
|---|---|
| Domains | Video_Processing, Optimization, Computer_Vision |
| Last Updated | 2026-02-14 16:45 GMT |
Overview
Use 2 FPS sampling with 128-frame windows and 28-pixel alignment factors to balance video processing quality against memory and compute costs.
Description
NeMo Curator defines a set of video and image processing constants that control frame extraction, window sizing, and pixel constraints for vision-language models. These constants are calibrated for Qwen VL and similar models that require specific input dimension alignment. The 28-pixel factor comes from patch-based vision transformers where patches are typically 14x14 or 28x28 pixels.
Usage
Apply these values when configuring video curation pipeline stages (frame extraction, embedding, captioning). The defaults are optimized for the models bundled with NeMo Curator. Only adjust if using a custom model with different input requirements.
The Insight (Rule of Thumb)
- Frame Sampling:
- FPS: 2.0 frames per second (default sampling rate for embeddings)
- Min frames: 4 (minimum after FPS sampling; videos shorter than this are skipped)
- Max frames: 768 (cap per clip to bound memory)
- Window Sizing:
- Window size: 128 frames per window
- Remainder threshold: 64 frames (if leftover frames >= 64, create a new window; otherwise merge with last window)
- Min window frames: 4 (windows smaller than this are discarded)
- Pixel Constraints (28-pixel alignment):
- Image: Min 3,136 pixels (4 x 28^2), Max 12,845,056 pixels (16384 x 28^2)
- Video frame: Min 100,352 pixels (128 x 28^2), Max 602,112 pixels (768 x 28^2)
- Video total: 24576 x 28^2 aggregate pixel budget across all frames
- Max aspect ratio: 200 (prevents extreme distortion)
- Trade-off: Higher FPS and larger windows improve quality but increase memory and compute. Lower FPS and smaller windows are faster but may miss visual details.
Reasoning
The constants are defined in `nemo_curator/utils/windowing_utils.py:175-189`:
IMAGE_FACTOR = 28
MIN_PIXELS = 4 * 28 * 28 # 3,136
MAX_PIXELS = 16384 * 28 * 28 # 12,845,056
MAX_RATIO = 200
VIDEO_MIN_PIXELS = 128 * 28 * 28 # 100,352
VIDEO_MAX_PIXELS = 768 * 28 * 28 # 602,112
VIDEO_TOTAL_PIXELS = 24576 * 28 * 28
FRAME_FACTOR = 2
FPS = 2.0
FPS_MIN_FRAMES = 4
FPS_MAX_FRAMES = 768
OPENAI_CLIP_MEAN = [0.48145466, 0.4578275, 0.40821073]
OPENAI_CLIP_STD = [0.26862954, 0.26130258, 0.27577711]
The windowing algorithm from `nemo_curator/utils/windowing_utils.py:41-76` follows this logic:
- If `total_frames < 4`: return empty (insufficient data for a meaningful window)
- If `total_frames <= window_size`: return single window covering the entire video
- Otherwise: create full windows of `window_size=128`, and if remainder >= `remainder_threshold=64`, create a new window; otherwise merge remainder with the last window
The 2 FPS sampling rate was chosen as a balance: fast enough to capture scene changes, slow enough to keep frame counts manageable. OpenAI CLIP normalization values are used for image preprocessing before embedding generation.