Implementation:Hpcaitech ColossalAI MTBenchDataset
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmarking |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
MTBenchDataset is a dataset wrapper class that loads and converts the MT-Bench multi-turn conversation benchmark into the ColossalEval inference format, supporting multi-turn dialogue evaluation with GPT-4 reference answers.
Description
The class extends BaseDataset and overrides the __init__ method to set a multiturn flag to True. Its static load method reads a "question.jsonl" file and a "reference_answer/gpt-4.jsonl" file from the data directory. Questions contain multi-turn instructions organized by category, and GPT-4 reference answers are loaded for categories that require them (math, reasoning, coding). The default inference kwargs set max_new_tokens to 1024 and turns to 2, with no loss calculation, reflecting the open-ended generative nature of this benchmark.
Usage
Use this class when you need to evaluate a language model on multi-turn conversation quality within the ColossalEval framework, typically in combination with the GPT judge evaluator for scoring.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalEval/colossal_eval/dataset/mtbench.py
- Lines: 1-75
Signature
class MTBenchDataset(BaseDataset):
def __init__(self, path, logger: DistributedLogger, *args, **kwargs):
@staticmethod
def load(path: str, logger: DistributedLogger, *args, **kwargs) -> List[Dict]:
Import
from colossal_eval.dataset.mtbench import MTBenchDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Path to the directory containing "question.jsonl" and "reference_answer/gpt-4.jsonl" |
| logger | DistributedLogger | Yes | Logger instance for distributed logging |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dict[str, Dict] | A nested dictionary with split "test" containing per-category entries, each with "data" (list of multi-turn data samples where instruction and output are lists) and "inference_kwargs" (calculate_loss=False, max_new_tokens=1024, turns=2) |
Usage Examples
from colossal_eval.dataset.mtbench import MTBenchDataset
from colossalai.logging import DistributedLogger
logger = DistributedLogger("mtbench")
dataset = MTBenchDataset(path="/path/to/mt_bench/data", logger=logger)
# dataset.multiturn is True
dataset.save("/path/to/output.json")