Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI MTBenchDataset

From Leeroopedia


Knowledge Sources
Domains Evaluation, Benchmarking
Last Updated 2026-02-09 00:00 GMT

Overview

MTBenchDataset is a dataset wrapper class that loads and converts the MT-Bench multi-turn conversation benchmark into the ColossalEval inference format, supporting multi-turn dialogue evaluation with GPT-4 reference answers.

Description

The class extends BaseDataset and overrides the __init__ method to set a multiturn flag to True. Its static load method reads a "question.jsonl" file and a "reference_answer/gpt-4.jsonl" file from the data directory. Questions contain multi-turn instructions organized by category, and GPT-4 reference answers are loaded for categories that require them (math, reasoning, coding). The default inference kwargs set max_new_tokens to 1024 and turns to 2, with no loss calculation, reflecting the open-ended generative nature of this benchmark.

Usage

Use this class when you need to evaluate a language model on multi-turn conversation quality within the ColossalEval framework, typically in combination with the GPT judge evaluator for scoring.

Code Reference

Source Location

Signature

class MTBenchDataset(BaseDataset):
    def __init__(self, path, logger: DistributedLogger, *args, **kwargs):
    @staticmethod
    def load(path: str, logger: DistributedLogger, *args, **kwargs) -> List[Dict]:

Import

from colossal_eval.dataset.mtbench import MTBenchDataset

I/O Contract

Inputs

Name Type Required Description
path str Yes Path to the directory containing "question.jsonl" and "reference_answer/gpt-4.jsonl"
logger DistributedLogger Yes Logger instance for distributed logging

Outputs

Name Type Description
dataset Dict[str, Dict] A nested dictionary with split "test" containing per-category entries, each with "data" (list of multi-turn data samples where instruction and output are lists) and "inference_kwargs" (calculate_loss=False, max_new_tokens=1024, turns=2)

Usage Examples

from colossal_eval.dataset.mtbench import MTBenchDataset
from colossalai.logging import DistributedLogger

logger = DistributedLogger("mtbench")
dataset = MTBenchDataset(path="/path/to/mt_bench/data", logger=logger)
# dataset.multiturn is True
dataset.save("/path/to/output.json")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment