Implementation:Hpcaitech ColossalAI ColossalDataset
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmarking |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
ColossalDataset is a dataset wrapper class that loads and converts custom Colossal evaluation data organized by categories into the ColossalEval inference format.
Description
The class extends BaseDataset and provides a static load method that reads a JSON file and groups data samples by their "category" field using the helper function get_data_per_category. For each category, it sets default inference kwargs with no loss calculation, no predefined answer classes, Chinese language, and a max_new_tokens of 256. The module also provides configurable sets single_choice_question and calculate_loss that allow users to customize which categories should use single-choice classification or loss-based evaluation.
Usage
Use this class when you need to evaluate a language model on a custom Colossal-format dataset within the ColossalEval framework. The input JSON file should contain a list of data samples, each with "category", "instruction", "input", "target", and "id" fields.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalEval/colossal_eval/dataset/colossalai.py
- Lines: 1-71
Signature
class ColossalDataset(BaseDataset):
@staticmethod
def load(path: str, logger: DistributedLogger, *args, **kwargs) -> List[Dict]:
Import
from colossal_eval.dataset.colossalai import ColossalDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Path to a JSON file containing a list of data samples with category, instruction, input, target, and id fields |
| logger | DistributedLogger | Yes | Logger instance for distributed logging |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dict[str, Dict] | A nested dictionary with split "test" containing per-category entries, each with "data" (list of data samples) and "inference_kwargs" (calculate_loss, all_classes, language, max_new_tokens=256) |
Usage Examples
from colossal_eval.dataset.colossalai import ColossalDataset
from colossalai.logging import DistributedLogger
logger = DistributedLogger("colossal")
dataset = ColossalDataset(path="/path/to/colossal_data.json", logger=logger)
dataset.save("/path/to/output.json")