Implementation:Hpcaitech ColossalAI Eval BaseDataset
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmarking |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
BaseDataset is the abstract base class for all dataset wrappers in the ColossalEval evaluation framework, defining the interface for loading and saving converted datasets.
Description
The module provides two classes: BaseDataset and DistributedDataset. BaseDataset requires subclasses to implement a static load method that converts original benchmark datasets into the ColossalEval inference format. It also provides a save method that serializes the converted dataset to JSON using the jdump utility. DistributedDataset extends PyTorch's Dataset class for distributed data loading.
Usage
Use BaseDataset as the parent class when creating a new dataset wrapper for ColossalEval. All concrete dataset classes (AGIEval, CEval, MMLU, etc.) inherit from this base class. Use DistributedDataset when you need to wrap data for distributed inference with PyTorch data loaders.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalEval/colossal_eval/dataset/base.py
- Lines: 1-39
Signature
class BaseDataset:
def __init__(self, path, logger, *args, **kwargs):
def save(self, save_path):
@abstractstaticmethod
def load(path, logger: DistributedLogger, *args, **kwargs):
class DistributedDataset(Dataset):
def __init__(self, data):
def __len__(self):
def __getitem__(self, idx):
Import
from colossal_eval.dataset.base import BaseDataset, DistributedDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Path to the original dataset files |
| logger | DistributedLogger | Yes | Logger instance for distributed logging |
Outputs
| Name | Type | Description |
|---|---|---|
| self.dataset | Dict | The converted dataset stored as an instance attribute after calling load |
Usage Examples
from colossal_eval.dataset.base import BaseDataset, DistributedDataset
# BaseDataset is abstract; use a concrete subclass
# Example with DistributedDataset for distributed inference
data = [{"input": "example1"}, {"input": "example2"}]
dist_dataset = DistributedDataset(data)
print(len(dist_dataset)) # 2
print(dist_dataset[0]) # {"input": "example1"}