Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI GPT Evaluate

From Leeroopedia


Knowledge Sources
Domains Evaluation, GPT, NLG Metrics, Benchmarking
Last Updated 2026-02-09 00:00 GMT

Overview

gpt_evaluate.py is an evaluation module that uses GPT models (GPT-3.5-turbo, GPT-4, text-davinci-003) to assess the quality of language model outputs through battle comparisons and metric-based scoring.

Description

This module provides two primary evaluation paradigms: battle mode (pairwise comparison of two models' answers scored by GPT-4) and metric-based evaluation (scoring individual answers on dimensions such as correctness, relevance, and fluency). It supports both chat completion models (GPT-3.5-turbo, GPT-4) which return plain text scores, and completion models (text-davinci-003) which return log probabilities for probabilistic scoring. The module handles multi-turn conversations for reference-based evaluation, concurrent API calls via ThreadPoolExecutor, result persistence in JSON format, and visualization of statistics using seaborn bar charts.

Usage

Use this module as part of the ColossalEval framework to evaluate generated text quality. It is called from the evaluation pipeline to score model outputs using GPT models as judges, supporting both head-to-head model comparisons and absolute quality scoring across multiple metrics.

Code Reference

Source Location

Signature

def get_battle_result(sys_prompt: str, user_prompt: str, id: int,
                      max_tokens: int = 2048) -> Dict[str, Any]

def parse_battle_score(evaluation: str) -> List[float]

def battle(answer1: List[Dict], answer2: List[Dict],
           prompt_dict: Dict[str, Any]) -> List[Dict]

def save_battle_results(evaluations: List[Dict], name1: str, name2: str,
                        save_path: str) -> None

def evaluate(answers: List[Dict], prompt: Dict[str, Any], metrics: List[str],
             category: str, save_path: str, model_name: str, model: str,
             language: str, references: List[Dict] = None) -> List[Dict]

def save_gpt_evaluation_results(model_name: str,
                                gpt_evaluation_results: Dict[str, Any],
                                save_path: str) -> Dict[str, Any]

def save_gpt_evaluation_statistics(model_name: str, evaluations: List[Dict],
                                   save_path: str) -> None

def analyze_gpt_evaluation_statistics(statistics_path: str,
                                      save_path: str) -> None

Import

from colossal_eval.evaluate.gpt_evaluate import (
    battle,
    evaluate,
    save_battle_results,
    save_gpt_evaluation_results,
    save_gpt_evaluation_statistics,
    analyze_gpt_evaluation_statistics,
)

I/O Contract

Inputs (battle function)

Name Type Required Description
answer1 List[Dict] Yes Answers from model 1, each with id, instruction, input, output, category
answer2 List[Dict] Yes Answers from model 2 (must match answer1 by id)
prompt_dict Dict[str, Any] Yes Battle prompt containing system_prompt, prompt_template, and prompt

Inputs (evaluate function)

Name Type Required Description
answers List[Dict] Yes Model answers to evaluate
prompt Dict[str, Any] Yes Prompt dict with prompt template, CoT, and metrics
metrics List[str] Yes Metrics for evaluation (e.g., Correctness, Relevance)
category str Yes Category of answers being evaluated
save_path str Yes Path to save evaluation results
model_name str Yes Name of the model being evaluated
model str Yes GPT model for judging (e.g., gpt-4, gpt-3.5-turbo, text-davinci-003)
language str Yes Language: en or cn
references List[Dict] No Reference answers for reference-based evaluation

Outputs

Name Type Description
evaluations List[Dict] List of evaluation results with scores per metric
statistics_json JSON file Per-category average scores, best 3, worst 3
visualization PNG files Bar chart visualizations comparing model scores per category
battle_results JSON file Win/loss/tie counts and win rates for pairwise comparison

Usage Examples

from colossal_eval.evaluate.gpt_evaluate import battle, save_battle_results, evaluate

# Pairwise battle evaluation
evaluations = battle(model1_answers, model2_answers, prompt_dict)
save_battle_results(evaluations, "model1", "model2", "./results")

# Metric-based evaluation with reference answers
results = evaluate(
    answers=model_answers,
    prompt=prompt_config,
    metrics=["Correctness", "Relevance", "Fluency"],
    category="general",
    save_path="./eval_results",
    model_name="my_model",
    model="gpt-4",
    language="en",
    references=reference_answers,
)

Key Functions

Battle Mode

  • get_battle_result - Gets a single pairwise comparison from GPT-4 with retry logic
  • parse_battle_score - Parses scores from GPT-4 evaluation text using multiple regex patterns
  • battle - Orchestrates concurrent pairwise comparisons across all answer pairs
  • save_battle_results - Computes win rates and saves categorized results (better, worse, tie, invalid)

Metric-Based Evaluation

  • get_gpt_evaluation_without_logprobs - Evaluates using chat models (GPT-3.5/GPT-4) with optional reference-based two-turn conversations
  • get_gpt_evaluation_with_logprobs - Evaluates using text-davinci-003 with log probability scoring
  • evaluate - Main evaluation function with concurrent execution, caching, and retry for failed evaluations

Scoring

  • calculate_scores_form_logprobs - Computes weighted score from log probabilities (scores 1-5)
  • calculate_scores_form_response - Extracts integer score from plain text response

Statistics and Visualization

  • save_gpt_evaluation_statistics - Generates per-category statistics (average, best 3, worst 3)
  • analyze_gpt_evaluation_statistics - Creates CSV tables and seaborn bar chart visualizations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment