Principle:Alibaba ROLL Video Generation and Reward
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Reinforcement_Learning |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A combined generation-and-reward principle where video denoising and reward scoring are performed in a single differentiable forward pass.
Description
Unlike LLM RL where generation and reward are separate steps, reward flow combines them into a single forward pass:
- Denoising (frozen steps 0 to mid-1): Run Euler ODE steps without gradients to partially denoise the latent
- Denoising (grad steps mid to final): Continue denoising with gradients enabled, allowing reward signal backpropagation
- VAE decoding: Decode latent to pixel space
- Reward scoring: Compute face identity similarity between generated and reference faces
- KL regularization: Compute KL divergence between LoRA-on and LoRA-off predictions for stability
The reward signal flows back through the differentiable denoising steps to update the LoRA parameters.
Usage
Use as the core forward pass in reward flow diffusion training.
Theoretical Basis
Reward flow optimization: Failed to parse (syntax error): {\displaystyle L = -\text{reward\_score} \cdot w_{reward} + \text{KL}(\text{LoRA\_on} \| \text{LoRA\_off}) }
The gradient flows through the Euler ODE steps:
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle: