Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sail sg LongSpec Benchmark Data Loader

From Leeroopedia
Knowledge Sources
Domains Evaluation, Benchmarking
Last Updated 2026-02-14 05:00 GMT

Overview

Concrete tool for loading LongBench JSONL data and AIME HuggingFace datasets with task-specific prompt formatting and context length filtering.

Description

The benchmark data loading is implemented inline in inference_long-bench.py and inference_qwq.py. It uses:

  • dataset2prompt dict mapping task names to prompt templates with {context} and {input} placeholders
  • json.loads() for JSONL file parsing (LongBench)
  • load_dataset("AI-MO/aimo-validation-aime") for AIME problems (QwQ)
  • Context length filtering to ensure samples fit within model capacity
  • Tokenization via AutoTokenizer with proper padding

This is a Pattern Doc — there is no single reusable class, but rather an inline pattern in evaluation scripts.

Usage

Used at the beginning of each evaluation script after model loading and before inference dispatch.

Code Reference

Source Location

  • Repository: LongSpec
  • File (LongBench): longspec/test/inference_long-bench.py
  • Lines (templates): L8-39
  • Lines (data loading): L114-129
  • File (AIME): longspec/test/inference_qwq.py
  • Lines (data loading): L48-67

Signature

# Pattern Doc: Inline data loading patterns

# LongBench prompt templates:
dataset2prompt = {
    "gov_report": "You are given a report by a government agency. "
                  "Write a one-page summary of the report.\n\n"
                  "Report:\n{context}\n\nSummary:",
    "qmsum": "You are given a meeting transcript and a query. "
             "Answer the query.\n\nTranscript:\n{context}\n\n"
             "Query: {input}\n\nAnswer:",
    "multi_news": "You are given several news passages. "
                  "Write a summary.\n\n"
                  "Passages:\n{context}\n\nSummary:",
    "lcc": "Please complete the code:\n{context}",
    "repobench-p": "Please complete the code:\n{context}",
}

# LongBench data loading pattern:
data = []
with open(f"{data_path_prefix}/{task}.jsonl") as f:
    for line in f:
        item = json.loads(line)
        prompt = dataset2prompt[task].format(
            context=item["context"],
            input=item.get("input", ""),
        )
        tok_len = len(tokenizer(prompt).input_ids)
        if 1200 < tok_len <= context_length:
            data.append((prompt, tok_len, item))

# AIME data loading pattern:
dataset = load_dataset("AI-MO/aimo-validation-aime")
filtered = dataset["train"].filter(
    lambda doc: id_min <= int(doc["id"]) <= id_max
)

Import

import json
from datasets import load_dataset
from transformers import AutoTokenizer

I/O Contract

Inputs

Name Type Required Description
task str Yes (LongBench) Task name: gov_report, qmsum, multi_news, lcc, repobench-p
data_path_prefix str Yes (LongBench) Directory containing {task}.jsonl files
test_length int Yes (LongBench) Context length multiplier for filtering
id_min / id_max int Yes (AIME) Problem ID range for AIME filtering (default: 60-60)

Outputs

Name Type Description
input_ids torch.Tensor Tokenized prompt on CUDA (batch_size, seq_len)
prompt_length int Number of prompt tokens (for generation start position)
raw_data List[Dict] Original data records with context, input, and answers

Usage Examples

LongBench Data Loading

import json
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("lmsys/vicuna-7b-v1.5")
task = "gov_report"
data_path = "/data/longbench"
context_length = 32000

dataset2prompt = {
    "gov_report": "Summarize the report:\n{context}\n\nSummary:",
}

data = []
with open(f"{data_path}/{task}.jsonl") as f:
    for line in f:
        item = json.loads(line)
        prompt = dataset2prompt[task].format(
            context=item["context"],
            input=item.get("input", ""),
        )
        tok_len = len(tokenizer(prompt).input_ids)
        if 1200 < tok_len <= context_length:
            data.append((prompt, tok_len, item))

# Tokenize for model input
input_ids = tokenizer(data[0][0], return_tensors="pt").input_ids.cuda()
prompt_length = input_ids.shape[1]

AIME Data Loading

from datasets import load_dataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B-Preview")
dataset = load_dataset("AI-MO/aimo-validation-aime")
filtered = dataset["train"].filter(
    lambda doc: 60 <= int(doc["id"]) <= 89
)

# Format with Qwen2 chat template
for item in filtered:
    prompt = (
        "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
        f"<|im_start|>user\n{item['problem']}<|im_end|>\n"
        "<|im_start|>assistant\n"
    )
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment