Implementation:Hpcaitech ColossalAI LongBenchDataset
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmarking |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
LongBenchDataset is a dataset wrapper class that loads and converts the LongBench long-context understanding benchmark into the ColossalEval inference format, covering 21 task types including QA, summarization, few-shot learning, and code completion.
Description
The class extends BaseDataset and provides a static load method that reads JSONL files from a flat directory, one per task category. Each task type has a custom prompt template defined in the dataset2prompt dictionary and a maximum generation length in dataset2maxlen, ranging from 32 tokens for QA tasks to 512 for summarization tasks. The module supports tasks in both English and Chinese, spanning single-document QA (NarrativeQA, Qasper), multi-document QA (HotpotQA, 2WikiMQA), summarization (GovReport, Multi-News, VCSUM), few-shot classification (TREC, LSHT), and code tasks (LCC, RepoBench-P). Files ending with "_e" are skipped as extended versions.
Usage
Use this class when you need to evaluate a language model's long-context comprehension abilities within the ColossalEval framework. The data directory should contain JSONL files from the LongBench dataset.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalEval/colossal_eval/dataset/longbench.py
- Lines: 1-121
Signature
class LongBenchDataset(BaseDataset):
@staticmethod
def load(path: str, logger: DistributedLogger, *args, **kwargs) -> List[Dict]:
Import
from colossal_eval.dataset.longbench import LongBenchDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Path to the directory containing per-task JSONL files (e.g., narrativeqa.jsonl, hotpotqa.jsonl) |
| logger | DistributedLogger | Yes | Logger instance for distributed logging |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dict[str, Dict] | A nested dictionary with split "test" containing per-task categories, each with "data" (list of data samples where target is a list of answer strings) and "inference_kwargs" (calculate_loss=True, all_classes from data, max_new_tokens per task type) |
Usage Examples
from colossal_eval.dataset.longbench import LongBenchDataset
from colossalai.logging import DistributedLogger
logger = DistributedLogger("longbench")
dataset = LongBenchDataset(path="/path/to/longbench/data", logger=logger)
dataset.save("/path/to/output.json")