Implementation:Hpcaitech ColossalAI GPT Judge
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmarking |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
gpt_judge is a module that implements GPT-4-based single-answer judging for the MT-Bench benchmark, using the OpenAI API to score model responses on multi-turn conversation tasks.
Description
The module provides functions for loading MT-Bench judge prompts, calling the OpenAI ChatCompletion API with retry logic, and evaluating model responses per question using concurrent thread execution. The get_mtbench_judgements function constructs GPT-4 judge prompts for each turn of a multi-turn conversation, supporting both regular and math/reasoning/coding categories that require reference answers. Scores are extracted from the judge's response using regex patterns that match "score" or "[score]" formats. The main entry point mtbench_single_judge processes all questions in parallel with up to 4 threads and returns scored data along with average ratings.
Usage
Use this module when you need to evaluate MT-Bench model outputs using GPT-4 as a judge within the ColossalEval framework. It requires an OpenAI API key and access to the GPT-4 model.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalEval/colossal_eval/evaluate/dataset_evaluator/gpt_judge.py
- Lines: 1-152
Signature
def load_mt_prompts(prompt_file: str):
def get_mt_prompt(prompts: Dict[str, str], multiturn: bool, math: bool):
def chat_compeletion_openai(messages: List[Dict], temperature: float = 0.0, max_tokens: int = 2048):
def get_mtbench_judgements(question: Dict[str, Any], prompts: Dict[str, str]):
def mtbench_single_judge(data: List[Dict], config_path: str):
Import
from colossal_eval.evaluate.dataset_evaluator.gpt_judge import mtbench_single_judge
I/O Contract
Inputs (mtbench_single_judge)
| Name | Type | Required | Description |
|---|---|---|---|
| data | List[Dict] | Yes | List of MT-Bench question dictionaries with fields id, category, instruction (list of turns), output (list of responses), and target (list of reference answers) |
| config_path | str | Yes | Path to the config file; the directory of this path is used to locate "mtbench_judge_prompts.jsonl" |
Outputs (mtbench_single_judge)
| Name | Type | Description |
|---|---|---|
| data_to_dump | List[Dict] | The input data augmented with "judgements" (list of judge response strings per turn) and "ratings" (list of numeric scores per turn) |
| avg_ratings | numpy.ndarray | Average ratings across all questions, one value per turn |
Usage Examples
from colossal_eval.evaluate.dataset_evaluator.gpt_judge import mtbench_single_judge
# data is a list of MT-Bench question dicts with model outputs
scored_data, avg_ratings = mtbench_single_judge(data, config_path="/path/to/config.json")
print(f"Average turn 1 rating: {avg_ratings[0]}")
print(f"Average turn 2 rating: {avg_ratings[1]}")