Implementation:Hpcaitech ColossalAI GPT Evaluate
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, GPT, NLG Metrics, Benchmarking |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
gpt_evaluate.py is an evaluation module that uses GPT models (GPT-3.5-turbo, GPT-4, text-davinci-003) to assess the quality of language model outputs through battle comparisons and metric-based scoring.
Description
This module provides two primary evaluation paradigms: battle mode (pairwise comparison of two models' answers scored by GPT-4) and metric-based evaluation (scoring individual answers on dimensions such as correctness, relevance, and fluency). It supports both chat completion models (GPT-3.5-turbo, GPT-4) which return plain text scores, and completion models (text-davinci-003) which return log probabilities for probabilistic scoring. The module handles multi-turn conversations for reference-based evaluation, concurrent API calls via ThreadPoolExecutor, result persistence in JSON format, and visualization of statistics using seaborn bar charts.
Usage
Use this module as part of the ColossalEval framework to evaluate generated text quality. It is called from the evaluation pipeline to score model outputs using GPT models as judges, supporting both head-to-head model comparisons and absolute quality scoring across multiple metrics.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalEval/colossal_eval/evaluate/gpt_evaluate.py
- Lines: 1-852
Signature
def get_battle_result(sys_prompt: str, user_prompt: str, id: int,
max_tokens: int = 2048) -> Dict[str, Any]
def parse_battle_score(evaluation: str) -> List[float]
def battle(answer1: List[Dict], answer2: List[Dict],
prompt_dict: Dict[str, Any]) -> List[Dict]
def save_battle_results(evaluations: List[Dict], name1: str, name2: str,
save_path: str) -> None
def evaluate(answers: List[Dict], prompt: Dict[str, Any], metrics: List[str],
category: str, save_path: str, model_name: str, model: str,
language: str, references: List[Dict] = None) -> List[Dict]
def save_gpt_evaluation_results(model_name: str,
gpt_evaluation_results: Dict[str, Any],
save_path: str) -> Dict[str, Any]
def save_gpt_evaluation_statistics(model_name: str, evaluations: List[Dict],
save_path: str) -> None
def analyze_gpt_evaluation_statistics(statistics_path: str,
save_path: str) -> None
Import
from colossal_eval.evaluate.gpt_evaluate import (
battle,
evaluate,
save_battle_results,
save_gpt_evaluation_results,
save_gpt_evaluation_statistics,
analyze_gpt_evaluation_statistics,
)
I/O Contract
Inputs (battle function)
| Name | Type | Required | Description |
|---|---|---|---|
| answer1 | List[Dict] | Yes | Answers from model 1, each with id, instruction, input, output, category |
| answer2 | List[Dict] | Yes | Answers from model 2 (must match answer1 by id) |
| prompt_dict | Dict[str, Any] | Yes | Battle prompt containing system_prompt, prompt_template, and prompt |
Inputs (evaluate function)
| Name | Type | Required | Description |
|---|---|---|---|
| answers | List[Dict] | Yes | Model answers to evaluate |
| prompt | Dict[str, Any] | Yes | Prompt dict with prompt template, CoT, and metrics |
| metrics | List[str] | Yes | Metrics for evaluation (e.g., Correctness, Relevance) |
| category | str | Yes | Category of answers being evaluated |
| save_path | str | Yes | Path to save evaluation results |
| model_name | str | Yes | Name of the model being evaluated |
| model | str | Yes | GPT model for judging (e.g., gpt-4, gpt-3.5-turbo, text-davinci-003) |
| language | str | Yes | Language: en or cn |
| references | List[Dict] | No | Reference answers for reference-based evaluation |
Outputs
| Name | Type | Description |
|---|---|---|
| evaluations | List[Dict] | List of evaluation results with scores per metric |
| statistics_json | JSON file | Per-category average scores, best 3, worst 3 |
| visualization | PNG files | Bar chart visualizations comparing model scores per category |
| battle_results | JSON file | Win/loss/tie counts and win rates for pairwise comparison |
Usage Examples
from colossal_eval.evaluate.gpt_evaluate import battle, save_battle_results, evaluate
# Pairwise battle evaluation
evaluations = battle(model1_answers, model2_answers, prompt_dict)
save_battle_results(evaluations, "model1", "model2", "./results")
# Metric-based evaluation with reference answers
results = evaluate(
answers=model_answers,
prompt=prompt_config,
metrics=["Correctness", "Relevance", "Fluency"],
category="general",
save_path="./eval_results",
model_name="my_model",
model="gpt-4",
language="en",
references=reference_answers,
)
Key Functions
Battle Mode
- get_battle_result - Gets a single pairwise comparison from GPT-4 with retry logic
- parse_battle_score - Parses scores from GPT-4 evaluation text using multiple regex patterns
- battle - Orchestrates concurrent pairwise comparisons across all answer pairs
- save_battle_results - Computes win rates and saves categorized results (better, worse, tie, invalid)
Metric-Based Evaluation
- get_gpt_evaluation_without_logprobs - Evaluates using chat models (GPT-3.5/GPT-4) with optional reference-based two-turn conversations
- get_gpt_evaluation_with_logprobs - Evaluates using text-davinci-003 with log probability scoring
- evaluate - Main evaluation function with concurrent execution, caching, and retry for failed evaluations
Scoring
- calculate_scores_form_logprobs - Computes weighted score from log probabilities (scores 1-5)
- calculate_scores_form_response - Extracts integer score from plain text response
Statistics and Visualization
- save_gpt_evaluation_statistics - Generates per-category statistics (average, best 3, worst 3)
- analyze_gpt_evaluation_statistics - Creates CSV tables and seaborn bar chart visualizations