Implementation:NVIDIA NeMo Curator CosmosEmbed1Model

Knowledge Sources	NVIDIA NeMo Curator
Domains	Video Processing, Embeddings, Deep Learning
Last Updated	2026-02-14 00:00 GMT

Overview

The CosmosEmbed1 class wraps NVIDIA's Cosmos-Embed1 multimodal embedding model for generating video and text embeddings used in video curation pipelines.

Description

CosmosEmbed1 implements the ModelInterface base class and provides a complete interface for the NVIDIA Cosmos-Embed1 multimodal embedding model. The model is available in three resolution variants: 224p, 336p, and 448p, each corresponding to a different HuggingFace model checkpoint (e.g., nvidia/Cosmos-Embed1-336p).

On setup, the model is loaded via AutoModel.from_pretrained with trust_remote_code=True onto CUDA in bfloat16 precision, alongside an AutoProcessor for input preprocessing. The class supports a utils_only mode where only the processor is initialized without loading the full model weights, which is useful for frame preprocessing on workers that do not need to run inference.

Key capabilities include:

Frame formulation: Uniformly samples target number of frames from a video clip and preprocesses them through the processor.
Video encoding: Produces video embeddings via get_video_embeddings, returning float16 tensors on CPU.
Text encoding: Encodes text strings via get_text_embeddings, returning float16 tensors on CPU.
Evaluation: Computes cosine similarity between video and text embeddings using a scaled dot product followed by softmax, returning top-k probabilities and indices.

Weight downloading is handled by class methods download_weights_on_node and download_processor_config_on_node, which fetch model artifacts from HuggingFace Hub with specific revision pinning per variant.

Usage

Use CosmosEmbed1 when you need to generate multimodal embeddings for video-text alignment tasks in the NeMo Curator video curation pipeline. It is the core model for embedding-based filtering, text-video matching, and semantic analysis of video content at scale.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/models/cosmos_embed1.py
Lines: 1-218

Signature

class CosmosEmbed1(ModelInterface):
    def __init__(
        self,
        *,
        variant: Literal["224p", "336p", "448p"] = "336p",
        utils_only: bool = False,
        model_dir: str | None = None,
    ) -> None: ...

    def setup(self) -> None: ...
    def get_target_num_frames(self) -> int: ...
    def formulate_input_frames(self, frames: list[npt.NDArray[np.uint8]]) -> npt.NDArray[np.float32] | None: ...
    def encode_video_frames(self, frames: npt.NDArray[np.float32]) -> torch.Tensor: ...
    def get_text_embedding(self, text: str) -> torch.Tensor: ...
    def evaluate(self, video_embd: torch.Tensor, text_embds: list[torch.Tensor]) -> tuple[list[float], list[int]]: ...

    @classmethod
    def download_weights_on_node(cls, model_dir: str, variant: Literal["224p", "336p", "448p"] = "336p") -> None: ...

    @classmethod
    def download_processor_config_on_node(cls, model_dir: str, variant: Literal["224p", "336p", "448p"] = "336p") -> None: ...

Import

from nemo_curator.models.cosmos_embed1 import CosmosEmbed1

I/O Contract

Inputs

Name	Type	Required	Description
variant	`Literal["224p", "336p", "448p"]`	No (default: "336p")	Resolution variant of the Cosmos-Embed1 model
utils_only	`bool`	No (default: False)	If True, only initialize the processor without loading model weights
model_dir	None	No	Directory containing model weights; used to construct the full weights path
frames (formulate_input_frames)	`list[npt.NDArray[np.uint8]]`	Yes	List of video frames as uint8 NumPy arrays
frames (encode_video_frames)	`npt.NDArray[np.float32]`	Yes	Preprocessed video frames as float32 NumPy array
text (get_text_embedding)	`str`	Yes	Input text string to encode
video_embd (evaluate)	`torch.Tensor`	Yes	Video embedding tensor
text_embds (evaluate)	`list[torch.Tensor]`	Yes	List of text embedding tensors for comparison

Outputs

Name	Type	Description
formulate_input_frames	None	Preprocessed input frames tensor, or None if frame count is insufficient
encode_video_frames	`torch.Tensor`	Video embedding tensor in float16 on CPU, shape `(batch, embed_dim)`
get_text_embedding	`torch.Tensor`	Text embedding tensor in float16 on CPU
evaluate	`tuple[list[float], list[int]]`	Tuple of (top-k probabilities, top-k indices) from softmax similarity

Usage Examples

Basic Usage

from nemo_curator.models.cosmos_embed1 import CosmosEmbed1

# Download weights first
CosmosEmbed1.download_weights_on_node(model_dir="/path/to/models", variant="336p")

# Initialize and set up the model
model = CosmosEmbed1(variant="336p", model_dir="/path/to/models")
model.setup()

# Get video embeddings from preprocessed frames
input_frames = model.formulate_input_frames(video_frames_list)
if input_frames is not None:
    video_embedding = model.encode_video_frames(input_frames)

# Get text embedding
text_embedding = model.get_text_embedding("a person walking on a beach")

# Evaluate similarity
probs, indices = model.evaluate(video_embedding, [text_embedding])

Utils-Only Mode

# Initialize processor only (no model weights loaded)
CosmosEmbed1.download_processor_config_on_node(model_dir="/path/to/models", variant="336p")
processor = CosmosEmbed1(variant="336p", utils_only=True, model_dir="/path/to/models")
processor.setup()

# Use processor for frame preprocessing
target_frames = processor.get_target_num_frames()
input_frames = processor.formulate_input_frames(video_frames_list)

Model Variants

Variant	HuggingFace Model ID	Revision
224p	`nvidia/Cosmos-Embed1-224p`	85f5627
336p	`nvidia/Cosmos-Embed1-336p`	5d8309d
448p	`nvidia/Cosmos-Embed1-448p`	9f4ff4d

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment