Principle:Alibaba ROLL Diffusion Model Preparation
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Model_Architecture |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A model preparation principle for initializing video diffusion model components (DiT, VAE, text encoder) with LoRA injection and reward scorer setup.
Description
Diffusion Model Preparation handles the complex initialization of a multi-component video diffusion pipeline:
- DiT (Diffusion Transformer): The main denoising model, injected with LoRA adapters for parameter-efficient training
- VAE: Video autoencoder for encoding/decoding pixel space (frozen)
- Text Encoder: Converts text prompts to conditioning embeddings (frozen)
- Reward Scorer: FaceAnalysis model for computing face identity similarity rewards using ONNX-based SCRFD detection and ArcFace embedding
- Euler Scheduler: ODE solver for the denoising trajectory with configurable timestep boundaries
Only LoRA parameters on the DiT are trainable; all other components remain frozen.
Usage
Use when initializing a reward flow training pipeline for video diffusion models.
Theoretical Basis
LoRA adds low-rank adaptations to the DiT's attention layers:
Where , , and .
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle:
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment