Implementation:Hpcaitech ColossalAI GPT Evaluate

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Evaluation, GPT, NLG Metrics, Benchmarking
Last Updated	2026-02-09 00:00 GMT

Overview

gpt_evaluate.py is an evaluation module that uses GPT models (GPT-3.5-turbo, GPT-4, text-davinci-003) to assess the quality of language model outputs through battle comparisons and metric-based scoring.

Description

This module provides two primary evaluation paradigms: battle mode (pairwise comparison of two models' answers scored by GPT-4) and metric-based evaluation (scoring individual answers on dimensions such as correctness, relevance, and fluency). It supports both chat completion models (GPT-3.5-turbo, GPT-4) which return plain text scores, and completion models (text-davinci-003) which return log probabilities for probabilistic scoring. The module handles multi-turn conversations for reference-based evaluation, concurrent API calls via ThreadPoolExecutor, result persistence in JSON format, and visualization of statistics using seaborn bar charts.

Usage

Use this module as part of the ColossalEval framework to evaluate generated text quality. It is called from the evaluation pipeline to score model outputs using GPT models as judges, supporting both head-to-head model comparisons and absolute quality scoring across multiple metrics.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalEval/colossal_eval/evaluate/gpt_evaluate.py
Lines: 1-852

Signature

def get_battle_result(sys_prompt: str, user_prompt: str, id: int,
                      max_tokens: int = 2048) -> Dict[str, Any]

def parse_battle_score(evaluation: str) -> List[float]

def battle(answer1: List[Dict], answer2: List[Dict],
           prompt_dict: Dict[str, Any]) -> List[Dict]

def save_battle_results(evaluations: List[Dict], name1: str, name2: str,
                        save_path: str) -> None

def evaluate(answers: List[Dict], prompt: Dict[str, Any], metrics: List[str],
             category: str, save_path: str, model_name: str, model: str,
             language: str, references: List[Dict] = None) -> List[Dict]

def save_gpt_evaluation_results(model_name: str,
                                gpt_evaluation_results: Dict[str, Any],
                                save_path: str) -> Dict[str, Any]

def save_gpt_evaluation_statistics(model_name: str, evaluations: List[Dict],
                                   save_path: str) -> None

def analyze_gpt_evaluation_statistics(statistics_path: str,
                                      save_path: str) -> None

Import

from colossal_eval.evaluate.gpt_evaluate import (
    battle,
    evaluate,
    save_battle_results,
    save_gpt_evaluation_results,
    save_gpt_evaluation_statistics,
    analyze_gpt_evaluation_statistics,
)

I/O Contract

Inputs (battle function)

Name	Type	Required	Description
answer1	List[Dict]	Yes	Answers from model 1, each with id, instruction, input, output, category
answer2	List[Dict]	Yes	Answers from model 2 (must match answer1 by id)
prompt_dict	Dict[str, Any]	Yes	Battle prompt containing system_prompt, prompt_template, and prompt

Inputs (evaluate function)

Name	Type	Required	Description
answers	List[Dict]	Yes	Model answers to evaluate
prompt	Dict[str, Any]	Yes	Prompt dict with prompt template, CoT, and metrics
metrics	List[str]	Yes	Metrics for evaluation (e.g., Correctness, Relevance)
category	str	Yes	Category of answers being evaluated
save_path	str	Yes	Path to save evaluation results
model_name	str	Yes	Name of the model being evaluated
model	str	Yes	GPT model for judging (e.g., gpt-4, gpt-3.5-turbo, text-davinci-003)
language	str	Yes	Language: en or cn
references	List[Dict]	No	Reference answers for reference-based evaluation

Outputs

Name	Type	Description
evaluations	List[Dict]	List of evaluation results with scores per metric
statistics_json	JSON file	Per-category average scores, best 3, worst 3
visualization	PNG files	Bar chart visualizations comparing model scores per category
battle_results	JSON file	Win/loss/tie counts and win rates for pairwise comparison

Usage Examples

from colossal_eval.evaluate.gpt_evaluate import battle, save_battle_results, evaluate

# Pairwise battle evaluation
evaluations = battle(model1_answers, model2_answers, prompt_dict)
save_battle_results(evaluations, "model1", "model2", "./results")

# Metric-based evaluation with reference answers
results = evaluate(
    answers=model_answers,
    prompt=prompt_config,
    metrics=["Correctness", "Relevance", "Fluency"],
    category="general",
    save_path="./eval_results",
    model_name="my_model",
    model="gpt-4",
    language="en",
    references=reference_answers,
)

Key Functions

Battle Mode

get_battle_result - Gets a single pairwise comparison from GPT-4 with retry logic
parse_battle_score - Parses scores from GPT-4 evaluation text using multiple regex patterns
battle - Orchestrates concurrent pairwise comparisons across all answer pairs
save_battle_results - Computes win rates and saves categorized results (better, worse, tie, invalid)

Metric-Based Evaluation

get_gpt_evaluation_without_logprobs - Evaluates using chat models (GPT-3.5/GPT-4) with optional reference-based two-turn conversations
get_gpt_evaluation_with_logprobs - Evaluates using text-davinci-003 with log probability scoring
evaluate - Main evaluation function with concurrent execution, caching, and retry for failed evaluations

Scoring

calculate_scores_form_logprobs - Computes weighted score from log probabilities (scores 1-5)
calculate_scores_form_response - Extracts integer score from plain text response

Statistics and Visualization

save_gpt_evaluation_statistics - Generates per-category statistics (average, best 3, worst 3)
analyze_gpt_evaluation_statistics - Creates CSV tables and seaborn bar chart visualizations

Related Pages

Environment:Hpcaitech_ColossalAI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment