Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL Video Generation and Reward

From Leeroopedia


Knowledge Sources
Domains Diffusion_Models, Reinforcement_Learning
Last Updated 2026-02-07 20:00 GMT

Overview

A combined generation-and-reward principle where video denoising and reward scoring are performed in a single differentiable forward pass.

Description

Unlike LLM RL where generation and reward are separate steps, reward flow combines them into a single forward pass:

  1. Denoising (frozen steps 0 to mid-1): Run Euler ODE steps without gradients to partially denoise the latent
  2. Denoising (grad steps mid to final): Continue denoising with gradients enabled, allowing reward signal backpropagation
  3. VAE decoding: Decode latent to pixel space
  4. Reward scoring: Compute face identity similarity between generated and reference faces
  5. KL regularization: Compute KL divergence between LoRA-on and LoRA-off predictions for stability

The reward signal flows back through the differentiable denoising steps to update the LoRA parameters.

Usage

Use as the core forward pass in reward flow diffusion training.

Theoretical Basis

Reward flow optimization: Failed to parse (syntax error): {\displaystyle L = -\text{reward\_score} \cdot w_{reward} + \text{KL}(\text{LoRA\_on} \| \text{LoRA\_off}) }

The gradient flows through the Euler ODE steps: LθLoRA

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment