Workflow:Alibaba ROLL Reward Flow Diffusion Pipeline

Knowledge Sources	Alibaba ROLL ROLL Documentation
Domains	Diffusion_Models, Reinforcement_Learning, Video_Generation, Distributed_Training
Last Updated	2026-02-07 19:00 GMT

Overview

End-to-end process for optimizing video diffusion models against reward scorers using reinforcement learning with reward flow-based training and LoRA parameter-efficient fine-tuning.

Description

This workflow implements the Reward Flow (Reward FL) pipeline in the ROLL framework, designed for training diffusion models (specifically the Wan2.2 video generation model) using RL-based reward optimization. Unlike the LLM-focused RLVR pipeline, this pipeline operates on continuous diffusion processes, using reward signals from visual quality scorers to guide the denoising trajectory. Training uses LoRA adapters applied to the diffusion transformer (DiT) component for parameter-efficient fine-tuning, with DeepSpeed ZeRO and CPU offloading to manage the large model's memory requirements.

Usage

Execute this workflow when you have a pre-trained video diffusion model (e.g., Wan2.2-14B) and a reward scorer (e.g., face identity preservation scorer), and you want to fine-tune the model to generate videos that score higher on the reward metric while maintaining overall generation quality.

Execution Steps

Step 1: Environment Setup and Configuration

Prepare the compute environment with the diffusion model dependencies (DiffSynth-Studio) and define the Hydra YAML configuration specifying the diffusion model paths, reward scorer paths, LoRA configuration, and training parameters. Configure the diffusion-specific parameters including inference steps, timestep boundaries, and gradient checkpointing offload settings.

Key considerations:

The Wan2.2 model requires four separate component paths (T5 encoder, VAE, DiT transformer, CLIP)
LoRA is applied only to the DiT component with configurable rank and target modules
Gradient checkpointing with CPU offload is essential for fitting the large model in GPU memory
Configure num_inference_steps and mid/final timestep for the reward flow computation

Step 2: Model and Data Preparation

Prepare the base diffusion model components and the reward scorer model. Load the video generation prompt dataset (CSV format with text descriptions). The pipeline generates videos from text prompts and evaluates them against the reward scorer.

What happens:

Diffusion model components (T5 text encoder, VAE decoder, DiT transformer, CLIP encoder) are loaded from specified paths
LoRA adapter matrices are initialized on the DiT component
Reward scorer model is loaded for evaluating generated video quality
Text prompts are loaded and preprocessed for video generation

Step 3: Distributed Worker Initialization

Launch the Ray cluster and initialize the diffusion actor training cluster with the DeepSpeed ZeRO strategy configured for CPU offloading. The training workers load the full diffusion pipeline including the LoRA adapters and prepare for the reward flow training loop.

Key considerations:

DeepSpeed ZeRO-2 with CPU offloading manages the large model memory footprint
Only LoRA adapter parameters are trained; the base model weights remain frozen
Workers handle both the forward diffusion process and the reward computation

Step 4: Video Generation and Reward Scoring

For each training batch, generate video samples from text prompts using the diffusion model's denoising process. Evaluate the generated videos against the reward scorer to obtain reward signals. The reward flow technique propagates reward gradients through the truncated diffusion trajectory.

What happens:

Text prompts are encoded and the diffusion process generates video frames through iterative denoising
The reward scorer evaluates generated videos (e.g., face identity preservation quality)
Reward signals are computed as scalar scores per generated video
The reward flow technique computes gradients through a subset of diffusion timesteps for efficiency

Step 5: LoRA Parameter Optimization

Compute the reward flow loss based on the reward signals and update the LoRA adapter parameters. The loss encourages the diffusion model to generate videos that score higher on the reward metric while staying close to the original model's generation distribution through implicit KL regularization from the timestep truncation.

Key considerations:

Only LoRA parameters receive gradients; base model weights are frozen
Gradient checkpointing offloads intermediate activations to CPU during backpropagation
Learning rate scheduling (constant or with warmup) controls the optimization dynamics
The reward is clipped and normalized for stable training

Step 6: Checkpointing and LoRA Merging

Save LoRA adapter checkpoints at configured intervals. After training, merge the LoRA adapter weights back into the base model to produce a single deployable model checkpoint. Optionally merge multiple sharded checkpoint files into a consolidated file.

Key considerations:

LoRA checkpoints are small and can be saved frequently
The provided merge_lora.py utility merges adapters into the base model's safetensors checkpoint
The merge_model.py utility consolidates sharded safetensors files into a single file
The merged model can be used directly for video generation without the LoRA runtime overhead

Execution Diagram

GitHub URL

Workflow Repository