Implementation:Hpcaitech ColossalAI Evaluator

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Evaluation, Benchmarking
Last Updated	2026-02-09 00:00 GMT

Overview

Evaluator is the main evaluation orchestrator class in ColossalEval that coordinates GPT-3.5/GPT-4 based evaluation and battle comparison of model outputs.

Description

The class manages two evaluation modes: battle for pairwise comparison between two models using GPT-4 as a reviewer, and evaluate for comprehensive single-model evaluation using configurable GPT-based metrics per category. The evaluate method processes answers per category, looks up appropriate prompts (falling back to a "general" prompt if no category-specific prompt is found), and delegates to the gpt_evaluate module. The save method handles persisting evaluation results, statistics, and analysis charts for both battle and single-model evaluation modes.

Usage

Use this class when you need to run GPT-based evaluation of model outputs in the ColossalEval framework. It supports both single-model evaluation with multiple metrics and head-to-head model comparison (battle mode).

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalEval/colossal_eval/evaluate/evaluator.py
Lines: 1-111

Signature

class Evaluator(object):
    def __init__(
        self,
        params: Dict[str, Any],
        battle_prompt: Dict[str, Any],
        gpt_evaluation_prompt: Dict[str, Any],
        gpt_model: str,
        language: str,
        gpt_with_reference: bool,
    ) -> None:

    def battle(self, answers1: List[Dict], answers2: List[Dict]) -> None:
    def evaluate(self, answers: List[Dict], targets: List[Dict], save_path: str, model_name: str) -> None:
    def save(self, path: str, model_name_list: List[str]) -> None:

Import

from colossal_eval.evaluate.evaluator import Evaluator

I/O Contract

Inputs (init)

Name	Type	Required	Description
params	Dict[str, Any]	Yes	Configuration dictionary mapping categories to their evaluation metrics (e.g., GPT metrics per category)
battle_prompt	Dict[str, Any]	Yes	Prompt template for pairwise model battle comparison
gpt_evaluation_prompt	Dict[str, Any]	Yes	Dictionary mapping categories to GPT evaluation prompt templates
gpt_model	str	Yes	Name of the GPT model to use for evaluation (e.g., "gpt-4")
language	str	Yes	Language for evaluation (e.g., "English", "Chinese")
gpt_with_reference	bool	Yes	Whether to include reference answers in GPT evaluation

Outputs

Name	Type	Description
self.gpt_evaluation_results	Dict	Dictionary of per-category GPT evaluation results populated after calling evaluate()
self.battle_results	List	List of battle comparison results populated after calling battle()

Usage Examples

from colossal_eval.evaluate.evaluator import Evaluator

evaluator = Evaluator(
    params={"general": {"GPT": ["relevance", "coherence"]}},
    battle_prompt=battle_prompt_dict,
    gpt_evaluation_prompt=eval_prompt_dict,
    gpt_model="gpt-4",
    language="English",
    gpt_with_reference=True,
)

# Single model evaluation
evaluator.evaluate(answers, targets, save_path="/path/to/save", model_name="my_model")
evaluator.save(path="/path/to/output", model_name_list=["my_model"])

# Battle mode (two models)
evaluator.battle(answers1, answers2)
evaluator.save(path="/path/to/output", model_name_list=["model_a", "model_b"])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment