Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL Diffusion Model Preparation

From Leeroopedia


Knowledge Sources
Domains Diffusion_Models, Model_Architecture
Last Updated 2026-02-07 20:00 GMT

Overview

A model preparation principle for initializing video diffusion model components (DiT, VAE, text encoder) with LoRA injection and reward scorer setup.

Description

Diffusion Model Preparation handles the complex initialization of a multi-component video diffusion pipeline:

  • DiT (Diffusion Transformer): The main denoising model, injected with LoRA adapters for parameter-efficient training
  • VAE: Video autoencoder for encoding/decoding pixel space (frozen)
  • Text Encoder: Converts text prompts to conditioning embeddings (frozen)
  • Reward Scorer: FaceAnalysis model for computing face identity similarity rewards using ONNX-based SCRFD detection and ArcFace embedding
  • Euler Scheduler: ODE solver for the denoising trajectory with configurable timestep boundaries

Only LoRA parameters on the DiT are trainable; all other components remain frozen.

Usage

Use when initializing a reward flow training pipeline for video diffusion models.

Theoretical Basis

LoRA adds low-rank adaptations to the DiT's attention layers: W=W+αBA

Where Ar×d, Bd×r, and rd.

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment