Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI GPT Judge

From Leeroopedia


Knowledge Sources
Domains Evaluation, Benchmarking
Last Updated 2026-02-09 00:00 GMT

Overview

gpt_judge is a module that implements GPT-4-based single-answer judging for the MT-Bench benchmark, using the OpenAI API to score model responses on multi-turn conversation tasks.

Description

The module provides functions for loading MT-Bench judge prompts, calling the OpenAI ChatCompletion API with retry logic, and evaluating model responses per question using concurrent thread execution. The get_mtbench_judgements function constructs GPT-4 judge prompts for each turn of a multi-turn conversation, supporting both regular and math/reasoning/coding categories that require reference answers. Scores are extracted from the judge's response using regex patterns that match "score" or "[score]" formats. The main entry point mtbench_single_judge processes all questions in parallel with up to 4 threads and returns scored data along with average ratings.

Usage

Use this module when you need to evaluate MT-Bench model outputs using GPT-4 as a judge within the ColossalEval framework. It requires an OpenAI API key and access to the GPT-4 model.

Code Reference

Source Location

Signature

def load_mt_prompts(prompt_file: str):
def get_mt_prompt(prompts: Dict[str, str], multiturn: bool, math: bool):
def chat_compeletion_openai(messages: List[Dict], temperature: float = 0.0, max_tokens: int = 2048):
def get_mtbench_judgements(question: Dict[str, Any], prompts: Dict[str, str]):
def mtbench_single_judge(data: List[Dict], config_path: str):

Import

from colossal_eval.evaluate.dataset_evaluator.gpt_judge import mtbench_single_judge

I/O Contract

Inputs (mtbench_single_judge)

Name Type Required Description
data List[Dict] Yes List of MT-Bench question dictionaries with fields id, category, instruction (list of turns), output (list of responses), and target (list of reference answers)
config_path str Yes Path to the config file; the directory of this path is used to locate "mtbench_judge_prompts.jsonl"

Outputs (mtbench_single_judge)

Name Type Description
data_to_dump List[Dict] The input data augmented with "judgements" (list of judge response strings per turn) and "ratings" (list of numeric scores per turn)
avg_ratings numpy.ndarray Average ratings across all questions, one value per turn

Usage Examples

from colossal_eval.evaluate.dataset_evaluator.gpt_judge import mtbench_single_judge

# data is a list of MT-Bench question dicts with model outputs
scored_data, avg_ratings = mtbench_single_judge(data, config_path="/path/to/config.json")
print(f"Average turn 1 rating: {avg_ratings[0]}")
print(f"Average turn 2 rating: {avg_ratings[1]}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment