Principle:NVIDIA NeMo Curator Video Embedding
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, Video_Processing, Representation_Learning |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Technique for computing dense vector representations of video clips using the Cosmos-Embed1 model for semantic similarity, retrieval, and deduplication.
Description
Video Embedding converts video clips into fixed-dimensional vector representations that capture semantic content. The Cosmos-Embed1 model processes extracted frames at configurable resolutions (224p, 336p, 448p) and produces embeddings suitable for semantic deduplication, nearest-neighbor search, and text-video alignment verification.
Usage
Use after frame extraction to compute embeddings for semantic deduplication or retrieval. Choose the resolution variant based on GPU memory constraints.
Theoretical Basis
- Extract frames at target FPS and resize to model input resolution
- Process frames through Cosmos-Embed1 encoder to produce per-clip embedding vectors
- Optionally compute text-video similarity scores for verification