Implementation:NVIDIA NeMo Aligner Image Text RMS
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Image Generation, Reward Modeling, CLIP |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
image_text_rms.py provides CLIP-based image-text reward models (PickscoreRewardModel and MegatronCLIPRewardModel) used to score image-text alignment quality for DRaFT+ training of diffusion models.
Description
This module contains the reward models used by DRaFT+ to provide differentiable reward signals for diffusion model alignment. It defines two main classes and a factory function:
PickscoreRewardModel (extends MegatronModule):
- A low-level CLIP-based reward model that computes image-text similarity scores.
- Contains a CLIPVisionTransformer for image encoding and a CLIPTextTransformer for text encoding.
- Maintains a learnable logit_scale parameter (initialized to log(1/0.07)).
- The get_reward method computes rewards as the scaled cosine similarity between L2-normalized image and text features: Failed to parse (syntax error): {\displaystyle R = e^{\text{logit\_scale}} \cdot \text{diag}(\hat{I} \cdot \hat{T}^T)} .
MegatronCLIPRewardModel (extends MegatronCLIPModel):
- A higher-level wrapper that adds differentiable image preprocessing for end-to-end gradient flow from the reward signal back through the diffusion model.
- Preprocessing pipeline: Resize to 224x224 (bicubic), center crop, rescale by 1/255, normalize with OpenAI dataset mean/std.
- get_reward first preprocesses images and tokenizes captions, then delegates to the inner model's get_reward.
- model_provider_func creates a PickscoreRewardModel instance for use with Megatron's pipeline parallelism.
- loss_func computes a categorical KL divergence loss for reward model training, comparing predicted reward rankings against human preference labels, along with an accuracy metric.
- build_train_valid_test_datasets builds Pickscore preference datasets for reward model training.
- dl_collate_fn provides a custom collate function that handles multi-crop image pairs (img_0, img_1) and prompts.
get_reward_model (factory function):
- Loads a pretrained MegatronCLIPRewardModel from a NeMo checkpoint using setup_trainer_and_model_for_inference.
- Applies configuration modifications (precision, disabling sequence parallelism and activation checkpointing).
- Accepts mbs and gbs parameters to configure batch sizes.
Usage
Use get_reward_model to load a pretrained CLIP reward model for DRaFT+ training. The reward model is attached to the diffusion model via ptl_model.reward_model = reward_model. The reward model's get_reward method is called during the DRaFT+ forward pass to compute differentiable reward scores.
Code Reference
Source Location
- Repository: NVIDIA_NeMo_Aligner
- File: nemo_aligner/models/mm/stable_diffusion/image_text_rms.py
- Lines: 1-284
Signature
class PickscoreRewardModel(MegatronModule):
def __init__(self, model_cfg, model_parallel_config, padded_vocab_size, pre_process=True, post_process=True):
def get_reward(self, images, captions):
def forward(self, images, captions):
class MegatronCLIPRewardModel(MegatronCLIPModel):
def __init__(self, cfg, trainer):
def diff_preprocess(self):
def preprocess(self, images, captions):
def get_reward(self, images, captions):
def model_provider_func(self, pre_process, post_process):
def loss_func(self, output_tensor):
def get_forward_output_and_loss_func(self):
def get_reward_model(cfg, mbs, gbs):
Import
from nemo_aligner.models.mm.stable_diffusion.image_text_rms import (
MegatronCLIPRewardModel,
PickscoreRewardModel,
get_reward_model,
)
I/O Contract
Inputs (get_reward)
| Name | Type | Required | Description |
|---|---|---|---|
| images | Tensor | Yes | Image tensor of shape [B, H, W, C] with values in [0, 255] (MegatronCLIPRewardModel) or preprocessed [B, C, H, W] (PickscoreRewardModel) |
| captions | list[str] or Tensor | Yes | Text captions as strings (MegatronCLIPRewardModel) or tokenized tensors (PickscoreRewardModel) |
Outputs (get_reward)
| Name | Type | Description |
|---|---|---|
| rewards | Tensor | Scalar reward scores of shape [B] representing image-text alignment quality |
Inputs (get_reward_model)
| Name | Type | Required | Description |
|---|---|---|---|
| cfg | DictConfig | Yes | Configuration containing model checkpoint path and trainer settings |
| mbs | int | Yes | Micro batch size for the reward model |
| gbs | int | Yes | Global batch size for the reward model |
Outputs (get_reward_model)
| Name | Type | Description |
|---|---|---|
| model | MegatronCLIPRewardModel | Loaded and configured CLIP reward model ready for inference |
Usage Examples
from nemo_aligner.models.mm.stable_diffusion.image_text_rms import get_reward_model
# Load reward model from checkpoint
reward_model = get_reward_model(cfg.rm, mbs=4, gbs=32)
reward_model = reward_model.to(torch.cuda.current_device())
# Compute rewards for generated images
# images: [B, H, W, C] tensor with values in [0, 255]
# captions: list of strings
rewards = reward_model.get_reward(images, captions) # returns [B] tensor
# Attach to DRaFT+ model for training
ptl_model.reward_model = reward_model