Implementation:Hpcaitech ColossalAI ColossalDataset

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Evaluation, Benchmarking
Last Updated	2026-02-09 00:00 GMT

Overview

ColossalDataset is a dataset wrapper class that loads and converts custom Colossal evaluation data organized by categories into the ColossalEval inference format.

Description

The class extends BaseDataset and provides a static load method that reads a JSON file and groups data samples by their "category" field using the helper function get_data_per_category. For each category, it sets default inference kwargs with no loss calculation, no predefined answer classes, Chinese language, and a max_new_tokens of 256. The module also provides configurable sets single_choice_question and calculate_loss that allow users to customize which categories should use single-choice classification or loss-based evaluation.

Usage

Use this class when you need to evaluate a language model on a custom Colossal-format dataset within the ColossalEval framework. The input JSON file should contain a list of data samples, each with "category", "instruction", "input", "target", and "id" fields.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalEval/colossal_eval/dataset/colossalai.py
Lines: 1-71

Signature

class ColossalDataset(BaseDataset):
    @staticmethod
    def load(path: str, logger: DistributedLogger, *args, **kwargs) -> List[Dict]:

Import

from colossal_eval.dataset.colossalai import ColossalDataset

I/O Contract

Inputs

Name	Type	Required	Description
path	str	Yes	Path to a JSON file containing a list of data samples with category, instruction, input, target, and id fields
logger	DistributedLogger	Yes	Logger instance for distributed logging

Outputs

Name	Type	Description
dataset	Dict[str, Dict]	A nested dictionary with split "test" containing per-category entries, each with "data" (list of data samples) and "inference_kwargs" (calculate_loss, all_classes, language, max_new_tokens=256)

Usage Examples

from colossal_eval.dataset.colossalai import ColossalDataset
from colossalai.logging import DistributedLogger

logger = DistributedLogger("colossal")
dataset = ColossalDataset(path="/path/to/colossal_data.json", logger=logger)
dataset.save("/path/to/output.json")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment