Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator CosmosEmbed1Model

From Leeroopedia
Knowledge Sources
Domains Video Processing, Embeddings, Deep Learning
Last Updated 2026-02-14 00:00 GMT

Overview

The CosmosEmbed1 class wraps NVIDIA's Cosmos-Embed1 multimodal embedding model for generating video and text embeddings used in video curation pipelines.

Description

CosmosEmbed1 implements the ModelInterface base class and provides a complete interface for the NVIDIA Cosmos-Embed1 multimodal embedding model. The model is available in three resolution variants: 224p, 336p, and 448p, each corresponding to a different HuggingFace model checkpoint (e.g., nvidia/Cosmos-Embed1-336p).

On setup, the model is loaded via AutoModel.from_pretrained with trust_remote_code=True onto CUDA in bfloat16 precision, alongside an AutoProcessor for input preprocessing. The class supports a utils_only mode where only the processor is initialized without loading the full model weights, which is useful for frame preprocessing on workers that do not need to run inference.

Key capabilities include:

  • Frame formulation: Uniformly samples target number of frames from a video clip and preprocesses them through the processor.
  • Video encoding: Produces video embeddings via get_video_embeddings, returning float16 tensors on CPU.
  • Text encoding: Encodes text strings via get_text_embeddings, returning float16 tensors on CPU.
  • Evaluation: Computes cosine similarity between video and text embeddings using a scaled dot product followed by softmax, returning top-k probabilities and indices.

Weight downloading is handled by class methods download_weights_on_node and download_processor_config_on_node, which fetch model artifacts from HuggingFace Hub with specific revision pinning per variant.

Usage

Use CosmosEmbed1 when you need to generate multimodal embeddings for video-text alignment tasks in the NeMo Curator video curation pipeline. It is the core model for embedding-based filtering, text-video matching, and semantic analysis of video content at scale.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/models/cosmos_embed1.py
  • Lines: 1-218

Signature

class CosmosEmbed1(ModelInterface):
    def __init__(
        self,
        *,
        variant: Literal["224p", "336p", "448p"] = "336p",
        utils_only: bool = False,
        model_dir: str | None = None,
    ) -> None: ...

    def setup(self) -> None: ...
    def get_target_num_frames(self) -> int: ...
    def formulate_input_frames(self, frames: list[npt.NDArray[np.uint8]]) -> npt.NDArray[np.float32] | None: ...
    def encode_video_frames(self, frames: npt.NDArray[np.float32]) -> torch.Tensor: ...
    def get_text_embedding(self, text: str) -> torch.Tensor: ...
    def evaluate(self, video_embd: torch.Tensor, text_embds: list[torch.Tensor]) -> tuple[list[float], list[int]]: ...

    @classmethod
    def download_weights_on_node(cls, model_dir: str, variant: Literal["224p", "336p", "448p"] = "336p") -> None: ...

    @classmethod
    def download_processor_config_on_node(cls, model_dir: str, variant: Literal["224p", "336p", "448p"] = "336p") -> None: ...

Import

from nemo_curator.models.cosmos_embed1 import CosmosEmbed1

I/O Contract

Inputs

Name Type Required Description
variant Literal["224p", "336p", "448p"] No (default: "336p") Resolution variant of the Cosmos-Embed1 model
utils_only bool No (default: False) If True, only initialize the processor without loading model weights
model_dir None No Directory containing model weights; used to construct the full weights path
frames (formulate_input_frames) list[npt.NDArray[np.uint8]] Yes List of video frames as uint8 NumPy arrays
frames (encode_video_frames) npt.NDArray[np.float32] Yes Preprocessed video frames as float32 NumPy array
text (get_text_embedding) str Yes Input text string to encode
video_embd (evaluate) torch.Tensor Yes Video embedding tensor
text_embds (evaluate) list[torch.Tensor] Yes List of text embedding tensors for comparison

Outputs

Name Type Description
formulate_input_frames None Preprocessed input frames tensor, or None if frame count is insufficient
encode_video_frames torch.Tensor Video embedding tensor in float16 on CPU, shape (batch, embed_dim)
get_text_embedding torch.Tensor Text embedding tensor in float16 on CPU
evaluate tuple[list[float], list[int]] Tuple of (top-k probabilities, top-k indices) from softmax similarity

Usage Examples

Basic Usage

from nemo_curator.models.cosmos_embed1 import CosmosEmbed1

# Download weights first
CosmosEmbed1.download_weights_on_node(model_dir="/path/to/models", variant="336p")

# Initialize and set up the model
model = CosmosEmbed1(variant="336p", model_dir="/path/to/models")
model.setup()

# Get video embeddings from preprocessed frames
input_frames = model.formulate_input_frames(video_frames_list)
if input_frames is not None:
    video_embedding = model.encode_video_frames(input_frames)

# Get text embedding
text_embedding = model.get_text_embedding("a person walking on a beach")

# Evaluate similarity
probs, indices = model.evaluate(video_embedding, [text_embedding])

Utils-Only Mode

# Initialize processor only (no model weights loaded)
CosmosEmbed1.download_processor_config_on_node(model_dir="/path/to/models", variant="336p")
processor = CosmosEmbed1(variant="336p", utils_only=True, model_dir="/path/to/models")
processor.setup()

# Use processor for frame preprocessing
target_frames = processor.get_target_num_frames()
input_frames = processor.formulate_input_frames(video_frames_list)

Model Variants

Variant HuggingFace Model ID Revision
224p nvidia/Cosmos-Embed1-224p 85f5627
336p nvidia/Cosmos-Embed1-336p 5d8309d
448p nvidia/Cosmos-Embed1-448p 9f4ff4d

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment