Workflow:Alibaba ROLL Reward Flow Diffusion Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Reinforcement_Learning, Video_Generation, Distributed_Training |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
End-to-end process for optimizing video diffusion models against reward scorers using reinforcement learning with reward flow-based training and LoRA parameter-efficient fine-tuning.
Description
This workflow implements the Reward Flow (Reward FL) pipeline in the ROLL framework, designed for training diffusion models (specifically the Wan2.2 video generation model) using RL-based reward optimization. Unlike the LLM-focused RLVR pipeline, this pipeline operates on continuous diffusion processes, using reward signals from visual quality scorers to guide the denoising trajectory. Training uses LoRA adapters applied to the diffusion transformer (DiT) component for parameter-efficient fine-tuning, with DeepSpeed ZeRO and CPU offloading to manage the large model's memory requirements.
Usage
Execute this workflow when you have a pre-trained video diffusion model (e.g., Wan2.2-14B) and a reward scorer (e.g., face identity preservation scorer), and you want to fine-tune the model to generate videos that score higher on the reward metric while maintaining overall generation quality.
Execution Steps
Step 1: Environment Setup and Configuration
Prepare the compute environment with the diffusion model dependencies (DiffSynth-Studio) and define the Hydra YAML configuration specifying the diffusion model paths, reward scorer paths, LoRA configuration, and training parameters. Configure the diffusion-specific parameters including inference steps, timestep boundaries, and gradient checkpointing offload settings.
Key considerations:
- The Wan2.2 model requires four separate component paths (T5 encoder, VAE, DiT transformer, CLIP)
- LoRA is applied only to the DiT component with configurable rank and target modules
- Gradient checkpointing with CPU offload is essential for fitting the large model in GPU memory
- Configure num_inference_steps and mid/final timestep for the reward flow computation
Step 2: Model and Data Preparation
Prepare the base diffusion model components and the reward scorer model. Load the video generation prompt dataset (CSV format with text descriptions). The pipeline generates videos from text prompts and evaluates them against the reward scorer.
What happens:
- Diffusion model components (T5 text encoder, VAE decoder, DiT transformer, CLIP encoder) are loaded from specified paths
- LoRA adapter matrices are initialized on the DiT component
- Reward scorer model is loaded for evaluating generated video quality
- Text prompts are loaded and preprocessed for video generation
Step 3: Distributed Worker Initialization
Launch the Ray cluster and initialize the diffusion actor training cluster with the DeepSpeed ZeRO strategy configured for CPU offloading. The training workers load the full diffusion pipeline including the LoRA adapters and prepare for the reward flow training loop.
Key considerations:
- DeepSpeed ZeRO-2 with CPU offloading manages the large model memory footprint
- Only LoRA adapter parameters are trained; the base model weights remain frozen
- Workers handle both the forward diffusion process and the reward computation
Step 4: Video Generation and Reward Scoring
For each training batch, generate video samples from text prompts using the diffusion model's denoising process. Evaluate the generated videos against the reward scorer to obtain reward signals. The reward flow technique propagates reward gradients through the truncated diffusion trajectory.
What happens:
- Text prompts are encoded and the diffusion process generates video frames through iterative denoising
- The reward scorer evaluates generated videos (e.g., face identity preservation quality)
- Reward signals are computed as scalar scores per generated video
- The reward flow technique computes gradients through a subset of diffusion timesteps for efficiency
Step 5: LoRA Parameter Optimization
Compute the reward flow loss based on the reward signals and update the LoRA adapter parameters. The loss encourages the diffusion model to generate videos that score higher on the reward metric while staying close to the original model's generation distribution through implicit KL regularization from the timestep truncation.
Key considerations:
- Only LoRA parameters receive gradients; base model weights are frozen
- Gradient checkpointing offloads intermediate activations to CPU during backpropagation
- Learning rate scheduling (constant or with warmup) controls the optimization dynamics
- The reward is clipped and normalized for stable training
Step 6: Checkpointing and LoRA Merging
Save LoRA adapter checkpoints at configured intervals. After training, merge the LoRA adapter weights back into the base model to produce a single deployable model checkpoint. Optionally merge multiple sharded checkpoint files into a consolidated file.
Key considerations:
- LoRA checkpoints are small and can be saved frequently
- The provided merge_lora.py utility merges adapters into the base model's safetensors checkpoint
- The merge_model.py utility consolidates sharded safetensors files into a single file
- The merged model can be used directly for video generation without the LoRA runtime overhead