Implementation:Hpcaitech ColossalAI MMLUDataset
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmarking |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
MMLUDataset is a dataset wrapper class that loads and converts the MMLU (Massive Multitask Language Understanding) benchmark into the ColossalEval inference format, supporting English multiple-choice evaluation across diverse academic subjects.
Description
The class extends BaseDataset and provides a static load method that reads CSV files from "dev" and "test" subdirectories. Subject names are derived from file names by replacing underscores with spaces and title-casing words (with special handling for "us" to remain "US"). Questions are formatted as English single-choice prompts with four options (A-D), and the module supports few-shot evaluation by prepending dev-split examples using the get_few_shot_data helper function. Default inference kwargs enable loss calculation with all_classes set to ["A", "B", "C", "D"] and language set to "English".
Usage
Use this class when you need to evaluate a language model on the MMLU benchmark within the ColossalEval framework. It expects the MMLU dataset organized with "dev" and "test" subdirectories containing per-subject CSV files.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalEval/colossal_eval/dataset/mmlu.py
- Lines: 1-74
Signature
class MMLUDataset(BaseDataset):
@staticmethod
def load(path: str, logger: DistributedLogger, few_shot: bool, *args, **kwargs) -> List[Dict]:
Import
from colossal_eval.dataset.mmlu import MMLUDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Path to the directory containing "dev" and "test" subdirectories with per-subject CSV files |
| logger | DistributedLogger | Yes | Logger instance for distributed logging |
| few_shot | bool | Yes | Whether to prepend dev-split examples as few-shot demonstrations for the test split |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dict[str, Dict] | A nested dictionary with "dev" and "test" splits, each containing subject categories with "data" (list of data samples) and "inference_kwargs" (calculate_loss=True, all_classes=["A","B","C","D"], language="English", max_new_tokens=32) |
Usage Examples
from colossal_eval.dataset.mmlu import MMLUDataset
from colossalai.logging import DistributedLogger
logger = DistributedLogger("mmlu")
dataset = MMLUDataset(path="/path/to/mmlu/data", logger=logger, few_shot=True)
dataset.save("/path/to/output.json")