Implementation:Hpcaitech ColossalAI DatasetEvaluator
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, NLP |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for computing evaluation metrics from inference results across multiple benchmarks, provided by ColossalEval.
Description
DatasetEvaluator loads evaluation config and inference results, then dispatches to appropriate metric functions based on the dataset. It aggregates results per category and per model.
Usage
Create with config and save paths, then call get_evaluation_results() for each dataset.
Code Reference
Source Location
- Repository: ColossalAI
- File: applications/ColossalEval/colossal_eval/evaluate/dataset_evaluator/dataset_evaluator.py
- Lines: 39-335
Signature
class DatasetEvaluator:
def __init__(self, config_path: str, save_path: str):
"""
Args:
config_path: Path to evaluation config JSON
save_path: Path to save evaluation results
"""
def get_evaluation_results(
self,
data: Dict[str, Union[str, Dict]],
dataset_name: str,
model_name: str,
metrics: List[str],
):
"""
Compute evaluation metrics for a dataset.
Args:
data: Inference results dict
dataset_name: Name of the benchmark
model_name: Name of the model
metrics: List of metric names to compute
"""
Import
from colossal_eval.evaluate.dataset_evaluator import DatasetEvaluator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config_path | str | Yes | Evaluation config JSON path |
| data | Dict | Yes | Inference results with model outputs and targets |
| dataset_name | str | Yes | Benchmark name (e.g., "mmlu", "gsm8k") |
| model_name | str | Yes | Model identifier |
| metrics | List[str] | Yes | Metrics to compute ("first_token_accuracy", "perplexity", etc.) |
Outputs
| Name | Type | Description |
|---|---|---|
| evaluation_results | Dict | Nested dict: model_name -> dataset_name -> category -> metric -> score |
Usage Examples
from colossal_eval.evaluate.dataset_evaluator import DatasetEvaluator
import json
evaluator = DatasetEvaluator(
config_path="config/evaluation/config.json",
save_path="./eval_results.json",
)
with open("results/mmlu.json") as f:
inference_data = json.load(f)
evaluator.get_evaluation_results(
data=inference_data,
dataset_name="mmlu",
model_name="llama-7b",
metrics=["first_token_accuracy"],
)
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment