Principle:Alibaba ROLL LoRA Parameter Optimization

Knowledge Sources	LoRA RewardFLow Alibaba ROLL
Domains	Diffusion_Models, Optimization
Last Updated	2026-02-07 20:00 GMT

Overview

A parameter-efficient optimization principle for updating LoRA adapters on diffusion models using reward flow gradients.

Description

LoRA Parameter Optimization updates only the low-rank adapter parameters on the DiT model, keeping all other components frozen. The loss combines normalized face identity reward with KL regularization:

loss = -(face_score - 0.54) / 0.16 * 0.1 + kl_loss

The reward normalization (subtracting 0.54 baseline, dividing by 0.16 scale) ensures stable gradient magnitudes.

Usage

Use as the training objective for reward flow diffusion model fine-tuning.

Theoretical Basis

The normalized reward guides LoRA updates: $θ_{L o R A} \leftarrow θ_{L o R A} - η \nabla_{θ_{L o R A}} [- \frac{r - b}{s} \cdot w + D_{K L}]$

Where $b = 0.54$ is the baseline, $s = 0.16$ is the scale, and $w = 0.1$ is the reward weight.

Related Pages

Implemented By

Implementation:Alibaba_ROLL_RewardFL_ActorWorker_Train_Step

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment