Implementation:Alibaba ROLL Agentic ActorWorker Loss Func

Knowledge Sources	Alibaba ROLL
Domains	Reinforcement_Learning, Agentic_AI
Last Updated	2026-02-07 20:00 GMT

Overview

Concrete agentic actor worker loss function with segment-level PPO support provided by the Alibaba ROLL library.

Description

The ActorWorker.loss_func method in the agentic pipeline computes the PPO loss with support for both token-level and segment-level ratio computation. It handles asymmetric clipping, KL penalty with reference model, entropy regularization, optional dual clipping, and train/infer log-probability correction.

Usage

Called by the training strategy (Megatron/DeepSpeed) during each forward-backward pass of policy optimization.

Code Reference

Source Location

Repository: Alibaba ROLL
File: roll/pipeline/agentic/agentic_actor_worker.py
Lines: L10-148

Signature

class ActorWorker(BaseActorWorker):
    def loss_func(
        self,
        data: DataProto,
        output_tensor: torch.Tensor
    ) -> Tuple[torch.Tensor, Dict[str, float]]:
        """
        Compute PPO loss for agentic policy optimization.

        Args:
            data: DataProto with response_mask, ref_log_probs, advantages,
                  input_ids, attention_mask, optionally infer_logprobs
            output_tensor: Model logits output

        Returns:
            (total_loss, metrics_dict) where metrics include:
            - actor/pg_loss, actor/kl_loss, actor/ppo_ratio_clipfrac
            - actor/ratio_mean, actor/ratio_max, actor/ratio_min
            - actor/approxkl, actor/policykl
        """

Import

from roll.pipeline.agentic.agentic_actor_worker import ActorWorker

I/O Contract

Inputs

Name	Type	Required	Description
data	DataProto	Yes	Training batch with advantages, old_log_probs, ref_log_probs, response_mask
output_tensor	torch.Tensor	Yes	Model logits from forward pass

Outputs

Name	Type	Description
total_loss	torch.Tensor	Scalar loss for gradient computation
metrics	Dict[str, float]	Training metrics (pg_loss, kl_loss, clipfrac, ratio stats, approxkl)

Usage Examples

# Called internally by the training strategy:
loss, metrics = actor_worker.loss_func(
    data=training_batch,
    output_tensor=model_logits
)

# metrics example:
# {"actor/pg_loss@sum": 0.05, "actor/kl_loss@sum": 0.01, "actor/ppo_ratio_clipfrac@sum": 0.12}

Related Pages

Implements Principle

Principle:Alibaba_ROLL_Segment_Masked_Policy_Optimization

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Environment:Alibaba_ROLL_CUDA_GPU_Environment

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment