Implementation:Hpcaitech ColossalAI Ray Performance Evaluator
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning, Performance Profiling, Distributed Training |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Performance evaluation callbacks for the Ray-based distributed RLHF pipeline, measuring throughput, TFLOPS, and timing breakdowns for both experience making and training phases.
Description
This module provides two callback classes: ExperienceMakerPerformanceEvaluator (subclass of MakerCallback) and TrainerPerformanceEvaluator (subclass of TrainerCallback). The experience maker evaluator tracks make-experience duration, send duration, and computes FLOP counts for actor generation, actor/critic/initial/reward model forward passes. The trainer evaluator tracks training duration, update duration, and computes FLOP counts for actor/critic forward-backward passes with optional gradient checkpoint overhead.
Both evaluators aggregate metrics across distributed workers using all_reduce_mean and print a formatted performance summary at the end of their respective lifecycles. The module also provides utility functions get_world_size, print_rank_0, all_reduce_mean, and a Timer helper class.
Usage
Use ExperienceMakerPerformanceEvaluator as a callback for ExperienceMakerHolder when profiling inference performance during experience generation. Use TrainerPerformanceEvaluator as a callback for DetachedTrainer when profiling training performance. Both are typically enabled via an eval_performance flag during initialization.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalChat/coati/ray/callbacks/performance_evaluator.py
- Lines: 1-214
Signature
class ExperienceMakerPerformanceEvaluator(MakerCallback):
def __init__(
self,
actor_num_params: int,
critic_num_params: int,
initial_model_num_params: int,
reward_model_num_params: int,
) -> None: ...
class TrainerPerformanceEvaluator(TrainerCallback):
def __init__(
self,
actor_num_params: int,
critic_num_params: int,
enable_grad_checkpoint: bool = False,
ignore_first_episodes: int = 1,
) -> None: ...
def get_world_size() -> int: ...
def print_rank_0(*args, **kwargs) -> None: ...
def all_reduce_mean(x: float, world_size: int) -> float: ...
class Timer:
def start(self) -> None: ...
def end(self) -> None: ...
def reset(self) -> None: ...
Import
from coati.ray.callbacks.performance_evaluator import (
ExperienceMakerPerformanceEvaluator,
TrainerPerformanceEvaluator,
Timer,
)
I/O Contract
Inputs (ExperienceMakerPerformanceEvaluator)
| Name | Type | Required | Description |
|---|---|---|---|
| actor_num_params | int | Yes | Number of parameters in the actor model |
| critic_num_params | int | Yes | Number of parameters in the critic model |
| initial_model_num_params | int | Yes | Number of parameters in the initial (reference) model |
| reward_model_num_params | int | Yes | Number of parameters in the reward model |
Inputs (TrainerPerformanceEvaluator)
| Name | Type | Required | Description |
|---|---|---|---|
| actor_num_params | int | Yes | Number of parameters in the actor model |
| critic_num_params | int | Yes | Number of parameters in the critic model |
| enable_grad_checkpoint | bool | No | Whether gradient checkpointing is enabled (default False) |
| ignore_first_episodes | int | No | Number of initial episodes to skip for warmup (default 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| return | None | Performance summary is printed to stdout on rank 0 |
Usage Examples
from coati.ray.callbacks.performance_evaluator import (
ExperienceMakerPerformanceEvaluator,
TrainerPerformanceEvaluator,
)
# For experience maker profiling
maker_evaluator = ExperienceMakerPerformanceEvaluator(
actor_num_params=7_000_000_000,
critic_num_params=7_000_000_000,
initial_model_num_params=7_000_000_000,
reward_model_num_params=7_000_000_000,
)
# For trainer profiling
trainer_evaluator = TrainerPerformanceEvaluator(
actor_num_params=7_000_000_000,
critic_num_params=7_000_000_000,
enable_grad_checkpoint=True,
ignore_first_episodes=1,
)