Principle:Alibaba ROLL Video Generation and Reward

Knowledge Sources	RewardFLow Alibaba ROLL
Domains	Diffusion_Models, Reinforcement_Learning
Last Updated	2026-02-07 20:00 GMT

Overview

A combined generation-and-reward principle where video denoising and reward scoring are performed in a single differentiable forward pass.

Description

Unlike LLM RL where generation and reward are separate steps, reward flow combines them into a single forward pass:

Denoising (frozen steps 0 to mid-1): Run Euler ODE steps without gradients to partially denoise the latent
Denoising (grad steps mid to final): Continue denoising with gradients enabled, allowing reward signal backpropagation
VAE decoding: Decode latent to pixel space
Reward scoring: Compute face identity similarity between generated and reference faces
KL regularization: Compute KL divergence between LoRA-on and LoRA-off predictions for stability

The reward signal flows back through the differentiable denoising steps to update the LoRA parameters.

Usage

Use as the core forward pass in reward flow diffusion training.

Theoretical Basis

Reward flow optimization: Failed to parse (syntax error): {\displaystyle L = -\text{reward\_score} \cdot w_{reward} + \text{KL}(\text{LoRA\_on} \| \text{LoRA\_off}) }

The gradient flows through the Euler ODE steps: $\frac{\partial L}{\partial θ_{L o R A}}$

Related Pages

Implemented By

Implementation:Alibaba_ROLL_WanTrainingModule_Forward

Related Heuristics

The following heuristics inform this principle:

Heuristic:Alibaba_ROLL_Reward_Clipping_Normalization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment