Implementation:Hpcaitech ColossalAI LongBenchDataset

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Evaluation, Benchmarking
Last Updated	2026-02-09 00:00 GMT

Overview

LongBenchDataset is a dataset wrapper class that loads and converts the LongBench long-context understanding benchmark into the ColossalEval inference format, covering 21 task types including QA, summarization, few-shot learning, and code completion.

Description

The class extends BaseDataset and provides a static load method that reads JSONL files from a flat directory, one per task category. Each task type has a custom prompt template defined in the dataset2prompt dictionary and a maximum generation length in dataset2maxlen, ranging from 32 tokens for QA tasks to 512 for summarization tasks. The module supports tasks in both English and Chinese, spanning single-document QA (NarrativeQA, Qasper), multi-document QA (HotpotQA, 2WikiMQA), summarization (GovReport, Multi-News, VCSUM), few-shot classification (TREC, LSHT), and code tasks (LCC, RepoBench-P). Files ending with "_e" are skipped as extended versions.

Usage

Use this class when you need to evaluate a language model's long-context comprehension abilities within the ColossalEval framework. The data directory should contain JSONL files from the LongBench dataset.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalEval/colossal_eval/dataset/longbench.py
Lines: 1-121

Signature

class LongBenchDataset(BaseDataset):
    @staticmethod
    def load(path: str, logger: DistributedLogger, *args, **kwargs) -> List[Dict]:

Import

from colossal_eval.dataset.longbench import LongBenchDataset

I/O Contract

Inputs

Name	Type	Required	Description
path	str	Yes	Path to the directory containing per-task JSONL files (e.g., narrativeqa.jsonl, hotpotqa.jsonl)
logger	DistributedLogger	Yes	Logger instance for distributed logging

Outputs

Name	Type	Description
dataset	Dict[str, Dict]	A nested dictionary with split "test" containing per-task categories, each with "data" (list of data samples where target is a list of answer strings) and "inference_kwargs" (calculate_loss=True, all_classes from data, max_new_tokens per task type)

Usage Examples

from colossal_eval.dataset.longbench import LongBenchDataset
from colossalai.logging import DistributedLogger

logger = DistributedLogger("longbench")
dataset = LongBenchDataset(path="/path/to/longbench/data", logger=logger)
dataset.save("/path/to/output.json")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment