Principle:Alibaba ROLL Diffusion Model Preparation

Knowledge Sources	LoRA RewardFLow Alibaba ROLL
Domains	Diffusion_Models, Model_Architecture
Last Updated	2026-02-07 20:00 GMT

Overview

A model preparation principle for initializing video diffusion model components (DiT, VAE, text encoder) with LoRA injection and reward scorer setup.

Diffusion Model Preparation handles the complex initialization of a multi-component video diffusion pipeline:

DiT (Diffusion Transformer): The main denoising model, injected with LoRA adapters for parameter-efficient training
VAE: Video autoencoder for encoding/decoding pixel space (frozen)
Text Encoder: Converts text prompts to conditioning embeddings (frozen)
Reward Scorer: FaceAnalysis model for computing face identity similarity rewards using ONNX-based SCRFD detection and ArcFace embedding
Euler Scheduler: ODE solver for the denoising trajectory with configurable timestep boundaries

Only LoRA parameters on the DiT are trainable; all other components remain frozen.

Use when initializing a reward flow training pipeline for video diffusion models.

LoRA adds low-rank adaptations to the DiT's attention layers: $W^{'} = W + α \cdot B \cdot A$

Where $A \in ℝ^{r \times d}$ , $B \in ℝ^{d \times r}$ , and $r ≪ d$ .

The following heuristics inform this principle:

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment