Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL Agentic ActorWorker Loss Func

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Agentic_AI
Last Updated 2026-02-07 20:00 GMT

Overview

Concrete agentic actor worker loss function with segment-level PPO support provided by the Alibaba ROLL library.

Description

The ActorWorker.loss_func method in the agentic pipeline computes the PPO loss with support for both token-level and segment-level ratio computation. It handles asymmetric clipping, KL penalty with reference model, entropy regularization, optional dual clipping, and train/infer log-probability correction.

Usage

Called by the training strategy (Megatron/DeepSpeed) during each forward-backward pass of policy optimization.

Code Reference

Source Location

  • Repository: Alibaba ROLL
  • File: roll/pipeline/agentic/agentic_actor_worker.py
  • Lines: L10-148

Signature

class ActorWorker(BaseActorWorker):
    def loss_func(
        self,
        data: DataProto,
        output_tensor: torch.Tensor
    ) -> Tuple[torch.Tensor, Dict[str, float]]:
        """
        Compute PPO loss for agentic policy optimization.

        Args:
            data: DataProto with response_mask, ref_log_probs, advantages,
                  input_ids, attention_mask, optionally infer_logprobs
            output_tensor: Model logits output

        Returns:
            (total_loss, metrics_dict) where metrics include:
            - actor/pg_loss, actor/kl_loss, actor/ppo_ratio_clipfrac
            - actor/ratio_mean, actor/ratio_max, actor/ratio_min
            - actor/approxkl, actor/policykl
        """

Import

from roll.pipeline.agentic.agentic_actor_worker import ActorWorker

I/O Contract

Inputs

Name Type Required Description
data DataProto Yes Training batch with advantages, old_log_probs, ref_log_probs, response_mask
output_tensor torch.Tensor Yes Model logits from forward pass

Outputs

Name Type Description
total_loss torch.Tensor Scalar loss for gradient computation
metrics Dict[str, float] Training metrics (pg_loss, kl_loss, clipfrac, ratio stats, approxkl)

Usage Examples

# Called internally by the training strategy:
loss, metrics = actor_worker.loss_func(
    data=training_batch,
    output_tensor=model_logits
)

# metrics example:
# {"actor/pg_loss@sum": 0.05, "actor/kl_loss@sum": 0.01, "actor/ppo_ratio_clipfrac@sum": 0.12}

Related Pages

Implements Principle

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment