Principle:NVIDIA NeMo Curator Video Embedding

Knowledge Sources	NeMo Curator
Domains	Data_Curation, Video_Processing, Representation_Learning
Last Updated	2026-02-14 17:00 GMT

Overview

Technique for computing dense vector representations of video clips using the Cosmos-Embed1 model for semantic similarity, retrieval, and deduplication.

Description

Video Embedding converts video clips into fixed-dimensional vector representations that capture semantic content. The Cosmos-Embed1 model processes extracted frames at configurable resolutions (224p, 336p, 448p) and produces embeddings suitable for semantic deduplication, nearest-neighbor search, and text-video alignment verification.

Usage

Use after frame extraction to compute embeddings for semantic deduplication or retrieval. Choose the resolution variant based on GPU memory constraints.

Theoretical Basis

Extract frames at target FPS and resize to model input resolution
Process frames through Cosmos-Embed1 encoder to produce per-clip embedding vectors
Optionally compute text-video similarity scores for verification

Related Pages

Implemented By

Implementation:NVIDIA_NeMo_Curator_CosmosEmbed1EmbeddingStage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment