Implementation:Alibaba ROLL RewardFL ActorWorker Train Step
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Optimization |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Concrete reward flow actor worker training step for diffusion model LoRA optimization provided by the Alibaba ROLL library.
Description
The ActorWorker.train_step and loss_func methods compute the reward flow loss combining face identity score and KL regularization, then dispatch the gradient update through the diffusion DeepSpeed strategy.
Usage
Called by the reward flow pipeline for each training batch.
Code Reference
Source Location
- Repository: Alibaba ROLL
- File: roll/pipeline/diffusion/reward_fl/actor_worker.py
- Lines: L15-60
Signature
class ActorWorker(BaseActorWorker):
@register(dispatch_mode=Dispatch.DP_MP_DISPATCH_FIRST, clear_cache=False)
def train_step(self, data: DataProto) -> DataProto:
"""
Training step for reward FL.
Args:
data: DataProto with video tensors and prompts
Returns:
DataProto with metrics (actor/loss, actor/face_score, actor/kl_loss)
"""
def loss_func(self, data, loss, face_score, kl_loss) -> Tuple[torch.Tensor, dict]:
"""
Compute reward FL loss.
Loss formula: -(face_score - 0.54) / 0.16 * 0.1 + kl_loss
"""
Import
from roll.pipeline.diffusion.reward_fl.actor_worker import ActorWorker
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | DataProto | Yes | Batch with video tensors and prompt strings |
Outputs
| Name | Type | Description |
|---|---|---|
| metrics | Dict | actor/loss, actor/face_score, actor/kl_loss |
Usage Examples
results = actor_train.execute_all_sync("train_step", batch)
Related Pages
Implements Principle
Requires Environment
Environment Dependencies
This implementation requires the following environment constraints:
- Environment:Alibaba_ROLL_CUDA_GPU_Environment
- Environment:Alibaba_ROLL_DeepSpeed_Training_Environment
Heuristics Applied
This implementation uses the following heuristics:
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment