Implementation:NVIDIA NeMo Aligner Image Text RMS

Knowledge Sources	NVIDIA_NeMo_Aligner
Domains	Multimodal, Image Generation, Reward Modeling, CLIP
Last Updated	2026-02-08 00:00 GMT

Overview

image_text_rms.py provides CLIP-based image-text reward models (PickscoreRewardModel and MegatronCLIPRewardModel) used to score image-text alignment quality for DRaFT+ training of diffusion models.

Description

This module contains the reward models used by DRaFT+ to provide differentiable reward signals for diffusion model alignment. It defines two main classes and a factory function:

PickscoreRewardModel (extends MegatronModule):

A low-level CLIP-based reward model that computes image-text similarity scores.
Contains a CLIPVisionTransformer for image encoding and a CLIPTextTransformer for text encoding.
Maintains a learnable logit_scale parameter (initialized to log(1/0.07)).
The get_reward method computes rewards as the scaled cosine similarity between L2-normalized image and text features: Failed to parse (syntax error): {\displaystyle R = e^{\text{logit\_scale}} \cdot \text{diag}(\hat{I} \cdot \hat{T}^T)} .

MegatronCLIPRewardModel (extends MegatronCLIPModel):

A higher-level wrapper that adds differentiable image preprocessing for end-to-end gradient flow from the reward signal back through the diffusion model.
Preprocessing pipeline: Resize to 224x224 (bicubic), center crop, rescale by 1/255, normalize with OpenAI dataset mean/std.
get_reward first preprocesses images and tokenizes captions, then delegates to the inner model's get_reward.
model_provider_func creates a PickscoreRewardModel instance for use with Megatron's pipeline parallelism.
loss_func computes a categorical KL divergence loss for reward model training, comparing predicted reward rankings against human preference labels, along with an accuracy metric.
build_train_valid_test_datasets builds Pickscore preference datasets for reward model training.
dl_collate_fn provides a custom collate function that handles multi-crop image pairs (img_0, img_1) and prompts.

get_reward_model (factory function):

Loads a pretrained MegatronCLIPRewardModel from a NeMo checkpoint using setup_trainer_and_model_for_inference.
Applies configuration modifications (precision, disabling sequence parallelism and activation checkpointing).
Accepts mbs and gbs parameters to configure batch sizes.

Usage

Use get_reward_model to load a pretrained CLIP reward model for DRaFT+ training. The reward model is attached to the diffusion model via ptl_model.reward_model = reward_model. The reward model's get_reward method is called during the DRaFT+ forward pass to compute differentiable reward scores.

Code Reference

Source Location

Repository: NVIDIA_NeMo_Aligner
File: nemo_aligner/models/mm/stable_diffusion/image_text_rms.py
Lines: 1-284

Signature

class PickscoreRewardModel(MegatronModule):
    def __init__(self, model_cfg, model_parallel_config, padded_vocab_size, pre_process=True, post_process=True):
    def get_reward(self, images, captions):
    def forward(self, images, captions):

class MegatronCLIPRewardModel(MegatronCLIPModel):
    def __init__(self, cfg, trainer):
    def diff_preprocess(self):
    def preprocess(self, images, captions):
    def get_reward(self, images, captions):
    def model_provider_func(self, pre_process, post_process):
    def loss_func(self, output_tensor):
    def get_forward_output_and_loss_func(self):

def get_reward_model(cfg, mbs, gbs):

Import

from nemo_aligner.models.mm.stable_diffusion.image_text_rms import (
    MegatronCLIPRewardModel,
    PickscoreRewardModel,
    get_reward_model,
)

I/O Contract

Inputs (get_reward)

Name	Type	Required	Description
images	Tensor	Yes	Image tensor of shape [B, H, W, C] with values in [0, 255] (MegatronCLIPRewardModel) or preprocessed [B, C, H, W] (PickscoreRewardModel)
captions	list[str] or Tensor	Yes	Text captions as strings (MegatronCLIPRewardModel) or tokenized tensors (PickscoreRewardModel)

Outputs (get_reward)

Name	Type	Description
rewards	Tensor	Scalar reward scores of shape [B] representing image-text alignment quality

Inputs (get_reward_model)

Name	Type	Required	Description
cfg	DictConfig	Yes	Configuration containing model checkpoint path and trainer settings
mbs	int	Yes	Micro batch size for the reward model
gbs	int	Yes	Global batch size for the reward model

Outputs (get_reward_model)

Name	Type	Description
model	MegatronCLIPRewardModel	Loaded and configured CLIP reward model ready for inference

Usage Examples

from nemo_aligner.models.mm.stable_diffusion.image_text_rms import get_reward_model

# Load reward model from checkpoint
reward_model = get_reward_model(cfg.rm, mbs=4, gbs=32)
reward_model = reward_model.to(torch.cuda.current_device())

# Compute rewards for generated images
# images: [B, H, W, C] tensor with values in [0, 255]
# captions: list of strings
rewards = reward_model.get_reward(images, captions)  # returns [B] tensor

# Attach to DRaFT+ model for training
ptl_model.reward_model = reward_model

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment