Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI Evaluator

From Leeroopedia


Knowledge Sources
Domains Evaluation, Benchmarking
Last Updated 2026-02-09 00:00 GMT

Overview

Evaluator is the main evaluation orchestrator class in ColossalEval that coordinates GPT-3.5/GPT-4 based evaluation and battle comparison of model outputs.

Description

The class manages two evaluation modes: battle for pairwise comparison between two models using GPT-4 as a reviewer, and evaluate for comprehensive single-model evaluation using configurable GPT-based metrics per category. The evaluate method processes answers per category, looks up appropriate prompts (falling back to a "general" prompt if no category-specific prompt is found), and delegates to the gpt_evaluate module. The save method handles persisting evaluation results, statistics, and analysis charts for both battle and single-model evaluation modes.

Usage

Use this class when you need to run GPT-based evaluation of model outputs in the ColossalEval framework. It supports both single-model evaluation with multiple metrics and head-to-head model comparison (battle mode).

Code Reference

Source Location

Signature

class Evaluator(object):
    def __init__(
        self,
        params: Dict[str, Any],
        battle_prompt: Dict[str, Any],
        gpt_evaluation_prompt: Dict[str, Any],
        gpt_model: str,
        language: str,
        gpt_with_reference: bool,
    ) -> None:

    def battle(self, answers1: List[Dict], answers2: List[Dict]) -> None:
    def evaluate(self, answers: List[Dict], targets: List[Dict], save_path: str, model_name: str) -> None:
    def save(self, path: str, model_name_list: List[str]) -> None:

Import

from colossal_eval.evaluate.evaluator import Evaluator

I/O Contract

Inputs (__init__)

Name Type Required Description
params Dict[str, Any] Yes Configuration dictionary mapping categories to their evaluation metrics (e.g., GPT metrics per category)
battle_prompt Dict[str, Any] Yes Prompt template for pairwise model battle comparison
gpt_evaluation_prompt Dict[str, Any] Yes Dictionary mapping categories to GPT evaluation prompt templates
gpt_model str Yes Name of the GPT model to use for evaluation (e.g., "gpt-4")
language str Yes Language for evaluation (e.g., "English", "Chinese")
gpt_with_reference bool Yes Whether to include reference answers in GPT evaluation

Outputs

Name Type Description
self.gpt_evaluation_results Dict Dictionary of per-category GPT evaluation results populated after calling evaluate()
self.battle_results List List of battle comparison results populated after calling battle()

Usage Examples

from colossal_eval.evaluate.evaluator import Evaluator

evaluator = Evaluator(
    params={"general": {"GPT": ["relevance", "coherence"]}},
    battle_prompt=battle_prompt_dict,
    gpt_evaluation_prompt=eval_prompt_dict,
    gpt_model="gpt-4",
    language="English",
    gpt_with_reference=True,
)

# Single model evaluation
evaluator.evaluate(answers, targets, save_path="/path/to/save", model_name="my_model")
evaluator.save(path="/path/to/output", model_name_list=["my_model"])

# Battle mode (two models)
evaluator.battle(answers1, answers2)
evaluator.save(path="/path/to/output", model_name_list=["model_a", "model_b"])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment