Implementation:Hpcaitech ColossalAI GSMDataset

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Evaluation, Benchmarking
Last Updated	2026-02-09 00:00 GMT

Overview

GSMDataset is a dataset wrapper class that loads and converts the GSM8K (Grade School Math) benchmark into the ColossalEval inference format, supporting chain-of-thought reasoning with few-shot prompting.

Description

The class extends BaseDataset and provides a static load method that reads JSONL files for test, optional train, and optional reference splits. The module includes a hardcoded few_shot_prompt containing 8 step-by-step math problem examples used for chain-of-thought (CoT) prompting. Questions are formatted with a "Let's think step by step" prompt to encourage reasoning. The loader supports a forward_only mode for perplexity calculation via overall loss, and a load_reference mode for mock test data.

Usage

Use this class when you need to evaluate a language model on grade-school math word problems within the ColossalEval framework. It supports both generative evaluation with chain-of-thought prompting and forward-only perplexity evaluation.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalEval/colossal_eval/dataset/gsm.py
Lines: 1-141

Signature

class GSMDataset(BaseDataset):
    @staticmethod
    def load(
        path: str, logger: DistributedLogger, few_shot: bool, forward_only: bool, load_train: bool, load_reference: bool
    ) -> List[Dict]:

Import

from colossal_eval.dataset.gsm import GSMDataset

I/O Contract

Inputs

Name	Type	Required	Description
path	str	Yes	Path to the directory containing test.jsonl (and optionally train.jsonl, mock_gsm8k_test.jsonl)
logger	DistributedLogger	Yes	Logger instance for distributed logging
few_shot	bool	Yes	Whether to include the 8 built-in chain-of-thought few-shot examples
forward_only	bool	Yes	Whether to enable forward-only mode for perplexity/loss calculation
load_train	bool	Yes	Whether to also load the training split
load_reference	bool	Yes	Whether to also load the reference/mock test split

Outputs

Name	Type	Description
dataset	Dict[str, Dict]	A nested dictionary with splits ("test", optionally "train" and "reference"), each containing a "math" category with "data" (list of data samples) and "inference_kwargs" (calculate_loss=True, language="English", max_new_tokens=256)

Usage Examples

from colossal_eval.dataset.gsm import GSMDataset
from colossalai.logging import DistributedLogger

logger = DistributedLogger("gsm")
dataset = GSMDataset(
    path="/path/to/gsm8k/data",
    logger=logger,
    few_shot=True,
    forward_only=False,
    load_train=False,
    load_reference=False
)
dataset.save("/path/to/output.json")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment