Implementation:Hpcaitech ColossalAI DatasetEvaluator

Knowledge Sources	ColossalAI
Domains	Evaluation, NLP
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for computing evaluation metrics from inference results across multiple benchmarks, provided by ColossalEval.

Description

DatasetEvaluator loads evaluation config and inference results, then dispatches to appropriate metric functions based on the dataset. It aggregates results per category and per model.

Usage

Create with config and save paths, then call get_evaluation_results() for each dataset.

Code Reference

Source Location

Repository: ColossalAI
File: applications/ColossalEval/colossal_eval/evaluate/dataset_evaluator/dataset_evaluator.py
Lines: 39-335

Signature

class DatasetEvaluator:
    def __init__(self, config_path: str, save_path: str):
        """
        Args:
            config_path: Path to evaluation config JSON
            save_path: Path to save evaluation results
        """

    def get_evaluation_results(
        self,
        data: Dict[str, Union[str, Dict]],
        dataset_name: str,
        model_name: str,
        metrics: List[str],
    ):
        """
        Compute evaluation metrics for a dataset.

        Args:
            data: Inference results dict
            dataset_name: Name of the benchmark
            model_name: Name of the model
            metrics: List of metric names to compute
        """

Import

from colossal_eval.evaluate.dataset_evaluator import DatasetEvaluator

I/O Contract

Inputs

Name	Type	Required	Description
config_path	str	Yes	Evaluation config JSON path
data	Dict	Yes	Inference results with model outputs and targets
dataset_name	str	Yes	Benchmark name (e.g., "mmlu", "gsm8k")
model_name	str	Yes	Model identifier
metrics	List[str]	Yes	Metrics to compute ("first_token_accuracy", "perplexity", etc.)

Outputs

Name	Type	Description
evaluation_results	Dict	Nested dict: model_name -> dataset_name -> category -> metric -> score

Usage Examples

from colossal_eval.evaluate.dataset_evaluator import DatasetEvaluator
import json

evaluator = DatasetEvaluator(
    config_path="config/evaluation/config.json",
    save_path="./eval_results.json",
)

with open("results/mmlu.json") as f:
    inference_data = json.load(f)

evaluator.get_evaluation_results(
    data=inference_data,
    dataset_name="mmlu",
    model_name="llama-7b",
    metrics=["first_token_accuracy"],
)

Related Pages

Implements Principle

Principle:Hpcaitech_ColossalAI_Benchmark_Metric_Computation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment