Implementation:Sail sg LongSpec Benchmark Data Loader
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmarking |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Concrete tool for loading LongBench JSONL data and AIME HuggingFace datasets with task-specific prompt formatting and context length filtering.
Description
The benchmark data loading is implemented inline in inference_long-bench.py and inference_qwq.py. It uses:
- dataset2prompt dict mapping task names to prompt templates with {context} and {input} placeholders
- json.loads() for JSONL file parsing (LongBench)
- load_dataset("AI-MO/aimo-validation-aime") for AIME problems (QwQ)
- Context length filtering to ensure samples fit within model capacity
- Tokenization via AutoTokenizer with proper padding
This is a Pattern Doc — there is no single reusable class, but rather an inline pattern in evaluation scripts.
Usage
Used at the beginning of each evaluation script after model loading and before inference dispatch.
Code Reference
Source Location
- Repository: LongSpec
- File (LongBench): longspec/test/inference_long-bench.py
- Lines (templates): L8-39
- Lines (data loading): L114-129
- File (AIME): longspec/test/inference_qwq.py
- Lines (data loading): L48-67
Signature
# Pattern Doc: Inline data loading patterns
# LongBench prompt templates:
dataset2prompt = {
"gov_report": "You are given a report by a government agency. "
"Write a one-page summary of the report.\n\n"
"Report:\n{context}\n\nSummary:",
"qmsum": "You are given a meeting transcript and a query. "
"Answer the query.\n\nTranscript:\n{context}\n\n"
"Query: {input}\n\nAnswer:",
"multi_news": "You are given several news passages. "
"Write a summary.\n\n"
"Passages:\n{context}\n\nSummary:",
"lcc": "Please complete the code:\n{context}",
"repobench-p": "Please complete the code:\n{context}",
}
# LongBench data loading pattern:
data = []
with open(f"{data_path_prefix}/{task}.jsonl") as f:
for line in f:
item = json.loads(line)
prompt = dataset2prompt[task].format(
context=item["context"],
input=item.get("input", ""),
)
tok_len = len(tokenizer(prompt).input_ids)
if 1200 < tok_len <= context_length:
data.append((prompt, tok_len, item))
# AIME data loading pattern:
dataset = load_dataset("AI-MO/aimo-validation-aime")
filtered = dataset["train"].filter(
lambda doc: id_min <= int(doc["id"]) <= id_max
)
Import
import json
from datasets import load_dataset
from transformers import AutoTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| task | str | Yes (LongBench) | Task name: gov_report, qmsum, multi_news, lcc, repobench-p |
| data_path_prefix | str | Yes (LongBench) | Directory containing {task}.jsonl files |
| test_length | int | Yes (LongBench) | Context length multiplier for filtering |
| id_min / id_max | int | Yes (AIME) | Problem ID range for AIME filtering (default: 60-60) |
Outputs
| Name | Type | Description |
|---|---|---|
| input_ids | torch.Tensor | Tokenized prompt on CUDA (batch_size, seq_len) |
| prompt_length | int | Number of prompt tokens (for generation start position) |
| raw_data | List[Dict] | Original data records with context, input, and answers |
Usage Examples
LongBench Data Loading
import json
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("lmsys/vicuna-7b-v1.5")
task = "gov_report"
data_path = "/data/longbench"
context_length = 32000
dataset2prompt = {
"gov_report": "Summarize the report:\n{context}\n\nSummary:",
}
data = []
with open(f"{data_path}/{task}.jsonl") as f:
for line in f:
item = json.loads(line)
prompt = dataset2prompt[task].format(
context=item["context"],
input=item.get("input", ""),
)
tok_len = len(tokenizer(prompt).input_ids)
if 1200 < tok_len <= context_length:
data.append((prompt, tok_len, item))
# Tokenize for model input
input_ids = tokenizer(data[0][0], return_tensors="pt").input_ids.cuda()
prompt_length = input_ids.shape[1]
AIME Data Loading
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B-Preview")
dataset = load_dataset("AI-MO/aimo-validation-aime")
filtered = dataset["train"].filter(
lambda doc: 60 <= int(doc["id"]) <= 89
)
# Format with Qwen2 chat template
for item in filtered:
prompt = (
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n{item['problem']}<|im_end|>\n"
"<|im_start|>assistant\n"
)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment