Implementation:Hpcaitech ColossalAI Evaluator
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmarking |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Evaluator is the main evaluation orchestrator class in ColossalEval that coordinates GPT-3.5/GPT-4 based evaluation and battle comparison of model outputs.
Description
The class manages two evaluation modes: battle for pairwise comparison between two models using GPT-4 as a reviewer, and evaluate for comprehensive single-model evaluation using configurable GPT-based metrics per category. The evaluate method processes answers per category, looks up appropriate prompts (falling back to a "general" prompt if no category-specific prompt is found), and delegates to the gpt_evaluate module. The save method handles persisting evaluation results, statistics, and analysis charts for both battle and single-model evaluation modes.
Usage
Use this class when you need to run GPT-based evaluation of model outputs in the ColossalEval framework. It supports both single-model evaluation with multiple metrics and head-to-head model comparison (battle mode).
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalEval/colossal_eval/evaluate/evaluator.py
- Lines: 1-111
Signature
class Evaluator(object):
def __init__(
self,
params: Dict[str, Any],
battle_prompt: Dict[str, Any],
gpt_evaluation_prompt: Dict[str, Any],
gpt_model: str,
language: str,
gpt_with_reference: bool,
) -> None:
def battle(self, answers1: List[Dict], answers2: List[Dict]) -> None:
def evaluate(self, answers: List[Dict], targets: List[Dict], save_path: str, model_name: str) -> None:
def save(self, path: str, model_name_list: List[str]) -> None:
Import
from colossal_eval.evaluate.evaluator import Evaluator
I/O Contract
Inputs (__init__)
| Name | Type | Required | Description |
|---|---|---|---|
| params | Dict[str, Any] | Yes | Configuration dictionary mapping categories to their evaluation metrics (e.g., GPT metrics per category) |
| battle_prompt | Dict[str, Any] | Yes | Prompt template for pairwise model battle comparison |
| gpt_evaluation_prompt | Dict[str, Any] | Yes | Dictionary mapping categories to GPT evaluation prompt templates |
| gpt_model | str | Yes | Name of the GPT model to use for evaluation (e.g., "gpt-4") |
| language | str | Yes | Language for evaluation (e.g., "English", "Chinese") |
| gpt_with_reference | bool | Yes | Whether to include reference answers in GPT evaluation |
Outputs
| Name | Type | Description |
|---|---|---|
| self.gpt_evaluation_results | Dict | Dictionary of per-category GPT evaluation results populated after calling evaluate() |
| self.battle_results | List | List of battle comparison results populated after calling battle() |
Usage Examples
from colossal_eval.evaluate.evaluator import Evaluator
evaluator = Evaluator(
params={"general": {"GPT": ["relevance", "coherence"]}},
battle_prompt=battle_prompt_dict,
gpt_evaluation_prompt=eval_prompt_dict,
gpt_model="gpt-4",
language="English",
gpt_with_reference=True,
)
# Single model evaluation
evaluator.evaluate(answers, targets, save_path="/path/to/save", model_name="my_model")
evaluator.save(path="/path/to/output", model_name_list=["my_model"])
# Battle mode (two models)
evaluator.battle(answers1, answers2)
evaluator.save(path="/path/to/output", model_name_list=["model_a", "model_b"])