Implementation:Sail sg LongSpec Benchmark Data Loader

Knowledge Sources	LongSpec
Domains	Evaluation, Benchmarking
Last Updated	2026-02-14 05:00 GMT

Overview

Concrete tool for loading LongBench JSONL data and AIME HuggingFace datasets with task-specific prompt formatting and context length filtering.

Description

The benchmark data loading is implemented inline in inference_long-bench.py and inference_qwq.py. It uses:

dataset2prompt dict mapping task names to prompt templates with {context} and {input} placeholders
json.loads() for JSONL file parsing (LongBench)
load_dataset("AI-MO/aimo-validation-aime") for AIME problems (QwQ)
Context length filtering to ensure samples fit within model capacity
Tokenization via AutoTokenizer with proper padding

This is a Pattern Doc — there is no single reusable class, but rather an inline pattern in evaluation scripts.

Usage

Used at the beginning of each evaluation script after model loading and before inference dispatch.

Code Reference

Source Location

Repository: LongSpec
File (LongBench): longspec/test/inference_long-bench.py
Lines (templates): L8-39
Lines (data loading): L114-129
File (AIME): longspec/test/inference_qwq.py
Lines (data loading): L48-67

Signature

# Pattern Doc: Inline data loading patterns

# LongBench prompt templates:
dataset2prompt = {
    "gov_report": "You are given a report by a government agency. "
                  "Write a one-page summary of the report.\n\n"
                  "Report:\n{context}\n\nSummary:",
    "qmsum": "You are given a meeting transcript and a query. "
             "Answer the query.\n\nTranscript:\n{context}\n\n"
             "Query: {input}\n\nAnswer:",
    "multi_news": "You are given several news passages. "
                  "Write a summary.\n\n"
                  "Passages:\n{context}\n\nSummary:",
    "lcc": "Please complete the code:\n{context}",
    "repobench-p": "Please complete the code:\n{context}",
}

# LongBench data loading pattern:
data = []
with open(f"{data_path_prefix}/{task}.jsonl") as f:
    for line in f:
        item = json.loads(line)
        prompt = dataset2prompt[task].format(
            context=item["context"],
            input=item.get("input", ""),
        )
        tok_len = len(tokenizer(prompt).input_ids)
        if 1200 < tok_len <= context_length:
            data.append((prompt, tok_len, item))

# AIME data loading pattern:
dataset = load_dataset("AI-MO/aimo-validation-aime")
filtered = dataset["train"].filter(
    lambda doc: id_min <= int(doc["id"]) <= id_max
)

Import

import json
from datasets import load_dataset
from transformers import AutoTokenizer

I/O Contract

Inputs

Name	Type	Required	Description
task	str	Yes (LongBench)	Task name: gov_report, qmsum, multi_news, lcc, repobench-p
data_path_prefix	str	Yes (LongBench)	Directory containing {task}.jsonl files
test_length	int	Yes (LongBench)	Context length multiplier for filtering
id_min / id_max	int	Yes (AIME)	Problem ID range for AIME filtering (default: 60-60)

Outputs

Name	Type	Description
input_ids	torch.Tensor	Tokenized prompt on CUDA (batch_size, seq_len)
prompt_length	int	Number of prompt tokens (for generation start position)
raw_data	List[Dict]	Original data records with context, input, and answers

Usage Examples

LongBench Data Loading

import json
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("lmsys/vicuna-7b-v1.5")
task = "gov_report"
data_path = "/data/longbench"
context_length = 32000

dataset2prompt = {
    "gov_report": "Summarize the report:\n{context}\n\nSummary:",
}

data = []
with open(f"{data_path}/{task}.jsonl") as f:
    for line in f:
        item = json.loads(line)
        prompt = dataset2prompt[task].format(
            context=item["context"],
            input=item.get("input", ""),
        )
        tok_len = len(tokenizer(prompt).input_ids)
        if 1200 < tok_len <= context_length:
            data.append((prompt, tok_len, item))

# Tokenize for model input
input_ids = tokenizer(data[0][0], return_tensors="pt").input_ids.cuda()
prompt_length = input_ids.shape[1]

AIME Data Loading

from datasets import load_dataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B-Preview")
dataset = load_dataset("AI-MO/aimo-validation-aime")
filtered = dataset["train"].filter(
    lambda doc: 60 <= int(doc["id"]) <= 89
)

# Format with Qwen2 chat template
for item in filtered:
    prompt = (
        "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
        f"<|im_start|>user\n{item['problem']}<|im_end|>\n"
        "<|im_start|>assistant\n"
    )
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

Related Pages

Implements Principle

Principle:Sail_sg_LongSpec_Benchmark_Data_Preparation

Requires Environment

Environment:Sail_sg_LongSpec_Inference_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment