Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI LongBenchDataset

From Leeroopedia


Knowledge Sources
Domains Evaluation, Benchmarking
Last Updated 2026-02-09 00:00 GMT

Overview

LongBenchDataset is a dataset wrapper class that loads and converts the LongBench long-context understanding benchmark into the ColossalEval inference format, covering 21 task types including QA, summarization, few-shot learning, and code completion.

Description

The class extends BaseDataset and provides a static load method that reads JSONL files from a flat directory, one per task category. Each task type has a custom prompt template defined in the dataset2prompt dictionary and a maximum generation length in dataset2maxlen, ranging from 32 tokens for QA tasks to 512 for summarization tasks. The module supports tasks in both English and Chinese, spanning single-document QA (NarrativeQA, Qasper), multi-document QA (HotpotQA, 2WikiMQA), summarization (GovReport, Multi-News, VCSUM), few-shot classification (TREC, LSHT), and code tasks (LCC, RepoBench-P). Files ending with "_e" are skipped as extended versions.

Usage

Use this class when you need to evaluate a language model's long-context comprehension abilities within the ColossalEval framework. The data directory should contain JSONL files from the LongBench dataset.

Code Reference

Source Location

Signature

class LongBenchDataset(BaseDataset):
    @staticmethod
    def load(path: str, logger: DistributedLogger, *args, **kwargs) -> List[Dict]:

Import

from colossal_eval.dataset.longbench import LongBenchDataset

I/O Contract

Inputs

Name Type Required Description
path str Yes Path to the directory containing per-task JSONL files (e.g., narrativeqa.jsonl, hotpotqa.jsonl)
logger DistributedLogger Yes Logger instance for distributed logging

Outputs

Name Type Description
dataset Dict[str, Dict] A nested dictionary with split "test" containing per-task categories, each with "data" (list of data samples where target is a list of answer strings) and "inference_kwargs" (calculate_loss=True, all_classes from data, max_new_tokens per task type)

Usage Examples

from colossal_eval.dataset.longbench import LongBenchDataset
from colossalai.logging import DistributedLogger

logger = DistributedLogger("longbench")
dataset = LongBenchDataset(path="/path/to/longbench/data", logger=logger)
dataset.save("/path/to/output.json")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment