Implementation:Alibaba ROLL RewardFL ActorWorker Train Step

Knowledge Sources	Alibaba ROLL
Domains	Diffusion_Models, Optimization
Last Updated	2026-02-07 20:00 GMT

Overview

Concrete reward flow actor worker training step for diffusion model LoRA optimization provided by the Alibaba ROLL library.

Description

The ActorWorker.train_step and loss_func methods compute the reward flow loss combining face identity score and KL regularization, then dispatch the gradient update through the diffusion DeepSpeed strategy.

Usage

Called by the reward flow pipeline for each training batch.

Code Reference

Source Location

Repository: Alibaba ROLL
File: roll/pipeline/diffusion/reward_fl/actor_worker.py
Lines: L15-60

Signature

class ActorWorker(BaseActorWorker):
    @register(dispatch_mode=Dispatch.DP_MP_DISPATCH_FIRST, clear_cache=False)
    def train_step(self, data: DataProto) -> DataProto:
        """
        Training step for reward FL.

        Args:
            data: DataProto with video tensors and prompts

        Returns:
            DataProto with metrics (actor/loss, actor/face_score, actor/kl_loss)
        """

    def loss_func(self, data, loss, face_score, kl_loss) -> Tuple[torch.Tensor, dict]:
        """
        Compute reward FL loss.

        Loss formula: -(face_score - 0.54) / 0.16 * 0.1 + kl_loss
        """

Import

from roll.pipeline.diffusion.reward_fl.actor_worker import ActorWorker

I/O Contract

Inputs

Name	Type	Required	Description
data	DataProto	Yes	Batch with video tensors and prompt strings

Outputs

Name	Type	Description
metrics	Dict	actor/loss, actor/face_score, actor/kl_loss

Usage Examples

results = actor_train.execute_all_sync("train_step", batch)

Related Pages

Implements Principle

Principle:Alibaba_ROLL_LoRA_Parameter_Optimization

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment