Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Aligner Image Text RMS

From Leeroopedia


Knowledge Sources
Domains Multimodal, Image Generation, Reward Modeling, CLIP
Last Updated 2026-02-08 00:00 GMT

Overview

image_text_rms.py provides CLIP-based image-text reward models (PickscoreRewardModel and MegatronCLIPRewardModel) used to score image-text alignment quality for DRaFT+ training of diffusion models.

Description

This module contains the reward models used by DRaFT+ to provide differentiable reward signals for diffusion model alignment. It defines two main classes and a factory function:

PickscoreRewardModel (extends MegatronModule):

  • A low-level CLIP-based reward model that computes image-text similarity scores.
  • Contains a CLIPVisionTransformer for image encoding and a CLIPTextTransformer for text encoding.
  • Maintains a learnable logit_scale parameter (initialized to log(1/0.07)).
  • The get_reward method computes rewards as the scaled cosine similarity between L2-normalized image and text features: Failed to parse (syntax error): {\displaystyle R = e^{\text{logit\_scale}} \cdot \text{diag}(\hat{I} \cdot \hat{T}^T)} .

MegatronCLIPRewardModel (extends MegatronCLIPModel):

  • A higher-level wrapper that adds differentiable image preprocessing for end-to-end gradient flow from the reward signal back through the diffusion model.
  • Preprocessing pipeline: Resize to 224x224 (bicubic), center crop, rescale by 1/255, normalize with OpenAI dataset mean/std.
  • get_reward first preprocesses images and tokenizes captions, then delegates to the inner model's get_reward.
  • model_provider_func creates a PickscoreRewardModel instance for use with Megatron's pipeline parallelism.
  • loss_func computes a categorical KL divergence loss for reward model training, comparing predicted reward rankings against human preference labels, along with an accuracy metric.
  • build_train_valid_test_datasets builds Pickscore preference datasets for reward model training.
  • dl_collate_fn provides a custom collate function that handles multi-crop image pairs (img_0, img_1) and prompts.

get_reward_model (factory function):

  • Loads a pretrained MegatronCLIPRewardModel from a NeMo checkpoint using setup_trainer_and_model_for_inference.
  • Applies configuration modifications (precision, disabling sequence parallelism and activation checkpointing).
  • Accepts mbs and gbs parameters to configure batch sizes.

Usage

Use get_reward_model to load a pretrained CLIP reward model for DRaFT+ training. The reward model is attached to the diffusion model via ptl_model.reward_model = reward_model. The reward model's get_reward method is called during the DRaFT+ forward pass to compute differentiable reward scores.

Code Reference

Source Location

  • Repository: NVIDIA_NeMo_Aligner
  • File: nemo_aligner/models/mm/stable_diffusion/image_text_rms.py
  • Lines: 1-284

Signature

class PickscoreRewardModel(MegatronModule):
    def __init__(self, model_cfg, model_parallel_config, padded_vocab_size, pre_process=True, post_process=True):
    def get_reward(self, images, captions):
    def forward(self, images, captions):

class MegatronCLIPRewardModel(MegatronCLIPModel):
    def __init__(self, cfg, trainer):
    def diff_preprocess(self):
    def preprocess(self, images, captions):
    def get_reward(self, images, captions):
    def model_provider_func(self, pre_process, post_process):
    def loss_func(self, output_tensor):
    def get_forward_output_and_loss_func(self):

def get_reward_model(cfg, mbs, gbs):

Import

from nemo_aligner.models.mm.stable_diffusion.image_text_rms import (
    MegatronCLIPRewardModel,
    PickscoreRewardModel,
    get_reward_model,
)

I/O Contract

Inputs (get_reward)

Name Type Required Description
images Tensor Yes Image tensor of shape [B, H, W, C] with values in [0, 255] (MegatronCLIPRewardModel) or preprocessed [B, C, H, W] (PickscoreRewardModel)
captions list[str] or Tensor Yes Text captions as strings (MegatronCLIPRewardModel) or tokenized tensors (PickscoreRewardModel)

Outputs (get_reward)

Name Type Description
rewards Tensor Scalar reward scores of shape [B] representing image-text alignment quality

Inputs (get_reward_model)

Name Type Required Description
cfg DictConfig Yes Configuration containing model checkpoint path and trainer settings
mbs int Yes Micro batch size for the reward model
gbs int Yes Global batch size for the reward model

Outputs (get_reward_model)

Name Type Description
model MegatronCLIPRewardModel Loaded and configured CLIP reward model ready for inference

Usage Examples

from nemo_aligner.models.mm.stable_diffusion.image_text_rms import get_reward_model

# Load reward model from checkpoint
reward_model = get_reward_model(cfg.rm, mbs=4, gbs=32)
reward_model = reward_model.to(torch.cuda.current_device())

# Compute rewards for generated images
# images: [B, H, W, C] tensor with values in [0, 255]
# captions: list of strings
rewards = reward_model.get_reward(images, captions)  # returns [B] tensor

# Attach to DRaFT+ model for training
ptl_model.reward_model = reward_model

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment