Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI DatasetEvaluator

From Leeroopedia


Knowledge Sources
Domains Evaluation, NLP
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for computing evaluation metrics from inference results across multiple benchmarks, provided by ColossalEval.

Description

DatasetEvaluator loads evaluation config and inference results, then dispatches to appropriate metric functions based on the dataset. It aggregates results per category and per model.

Usage

Create with config and save paths, then call get_evaluation_results() for each dataset.

Code Reference

Source Location

  • Repository: ColossalAI
  • File: applications/ColossalEval/colossal_eval/evaluate/dataset_evaluator/dataset_evaluator.py
  • Lines: 39-335

Signature

class DatasetEvaluator:
    def __init__(self, config_path: str, save_path: str):
        """
        Args:
            config_path: Path to evaluation config JSON
            save_path: Path to save evaluation results
        """

    def get_evaluation_results(
        self,
        data: Dict[str, Union[str, Dict]],
        dataset_name: str,
        model_name: str,
        metrics: List[str],
    ):
        """
        Compute evaluation metrics for a dataset.

        Args:
            data: Inference results dict
            dataset_name: Name of the benchmark
            model_name: Name of the model
            metrics: List of metric names to compute
        """

Import

from colossal_eval.evaluate.dataset_evaluator import DatasetEvaluator

I/O Contract

Inputs

Name Type Required Description
config_path str Yes Evaluation config JSON path
data Dict Yes Inference results with model outputs and targets
dataset_name str Yes Benchmark name (e.g., "mmlu", "gsm8k")
model_name str Yes Model identifier
metrics List[str] Yes Metrics to compute ("first_token_accuracy", "perplexity", etc.)

Outputs

Name Type Description
evaluation_results Dict Nested dict: model_name -> dataset_name -> category -> metric -> score

Usage Examples

from colossal_eval.evaluate.dataset_evaluator import DatasetEvaluator
import json

evaluator = DatasetEvaluator(
    config_path="config/evaluation/config.json",
    save_path="./eval_results.json",
)

with open("results/mmlu.json") as f:
    inference_data = json.load(f)

evaluator.get_evaluation_results(
    data=inference_data,
    dataset_name="mmlu",
    model_name="llama-7b",
    metrics=["first_token_accuracy"],
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment