Implementation:Hpcaitech ColossalAI MTBenchDataset

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Evaluation, Benchmarking
Last Updated	2026-02-09 00:00 GMT

Overview

MTBenchDataset is a dataset wrapper class that loads and converts the MT-Bench multi-turn conversation benchmark into the ColossalEval inference format, supporting multi-turn dialogue evaluation with GPT-4 reference answers.

Description

The class extends BaseDataset and overrides the __init__ method to set a multiturn flag to True. Its static load method reads a "question.jsonl" file and a "reference_answer/gpt-4.jsonl" file from the data directory. Questions contain multi-turn instructions organized by category, and GPT-4 reference answers are loaded for categories that require them (math, reasoning, coding). The default inference kwargs set max_new_tokens to 1024 and turns to 2, with no loss calculation, reflecting the open-ended generative nature of this benchmark.

Usage

Use this class when you need to evaluate a language model on multi-turn conversation quality within the ColossalEval framework, typically in combination with the GPT judge evaluator for scoring.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalEval/colossal_eval/dataset/mtbench.py
Lines: 1-75

Signature

class MTBenchDataset(BaseDataset):
    def __init__(self, path, logger: DistributedLogger, *args, **kwargs):
    @staticmethod
    def load(path: str, logger: DistributedLogger, *args, **kwargs) -> List[Dict]:

Import

from colossal_eval.dataset.mtbench import MTBenchDataset

I/O Contract

Inputs

Name	Type	Required	Description
path	str	Yes	Path to the directory containing "question.jsonl" and "reference_answer/gpt-4.jsonl"
logger	DistributedLogger	Yes	Logger instance for distributed logging

Outputs

Name	Type	Description
dataset	Dict[str, Dict]	A nested dictionary with split "test" containing per-category entries, each with "data" (list of multi-turn data samples where instruction and output are lists) and "inference_kwargs" (calculate_loss=False, max_new_tokens=1024, turns=2)

Usage Examples

from colossal_eval.dataset.mtbench import MTBenchDataset
from colossalai.logging import DistributedLogger

logger = DistributedLogger("mtbench")
dataset = MTBenchDataset(path="/path/to/mt_bench/data", logger=logger)
# dataset.multiturn is True
dataset.save("/path/to/output.json")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment