Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Sail sg LongSpec Long Context Evaluation

From Leeroopedia
Knowledge Sources
Domains LLMs, Speculative_Decoding, Benchmarking
Last Updated 2025-02-01 00:00 GMT

Overview

Systematic evaluation pipeline for benchmarking LongSpec speculative decoding performance across long-context NLP tasks (LongBench) and mathematical reasoning tasks (AIME) using multiple inference methods.

Description

This workflow evaluates the inference speed and quality of LongSpec's speculative decoding system across two benchmark suites:

LongBench: A suite of long-context tasks including government report summarization (GovReport), meeting summarization (QMSum), multi-document summarization (MultiNews), and code completion (LCC, RepoBench-P). Each task tests the system with inputs ranging from 1,200 to 262,000 tokens.

AIME: Mathematical reasoning benchmark using the AI-MO/aimo-validation-aime dataset, specifically testing the QwQ-32B-Preview model on AIME 2024 problems. These problems require long chain-of-thought reasoning (up to 20,000 generated tokens).

The evaluation compares throughput (tokens per second) and acceptance rate across different inference methods (vanilla, sequential, tree, MagicDec) to quantify the speedup provided by speculative decoding.

Usage

Execute this workflow when you need to benchmark a trained LongSpec model's inference performance, compare different speculative decoding methods, or validate that a newly trained draft model matches expected speedup metrics. This is typically run after completing the GLIDE Draft Model Training workflow, using the final checkpoint.

Execution Steps

Step 1: Benchmark_Data_Preparation

Prepare evaluation datasets for the target benchmark suite. For LongBench tasks, preprocess the raw data into JSONL format with one entry per line, each containing context, input, and answer fields. For AIME evaluation, the dataset is loaded directly from Hugging Face (AI-MO/aimo-validation-aime). Organize data files into an accessible directory structure.

Key considerations:

  • LongBench data must be preprocessed into .jsonl format before use
  • AIME data is loaded via the datasets library (no preprocessing needed)
  • LongBench tasks: gov_report, qmsum, multi_news, lcc, repobench-p
  • AIME questions with id 60-89 correspond to AIME 2024 problems

Step 2: Model_and_Tokenizer_Loading

Load the target LLM and its corresponding GLIDE draft model for evaluation. For LongBench tasks (Llama family), instantiate LlamaGlide with the appropriate target and draft model pair. For AIME tasks (Qwen2 family), instantiate Qwen2Glide with QwQ-32B-Preview as target. Configure model-specific token IDs and context length limits.

Key considerations:

  • Supported Llama models: Vicuna-7B/13B, LongChat-7B/13B, Llama-3-8B-262k
  • Supported Qwen2 models: QwQ-32B-Preview
  • Context length varies by model (16k to 262k tokens, minus 2000 for generation headroom)
  • A single 80GB GPU is recommended to avoid out-of-memory issues

Step 3: Prompt_Formatting

Apply task-specific prompt templates to each evaluation example. LongBench tasks use Llama chat format with system/user/assistant delimiters and task-specific instructions (e.g., "Write a one-page summary" for GovReport). AIME tasks use Qwen2 im_start/im_end format. After formatting, tokenize prompts and filter by length constraints (minimum 1200 tokens for LongBench).

Key considerations:

  • Each LongBench task has a dedicated prompt template with context and optional input slots
  • Prompts exceeding the model's context length are excluded
  • Short prompts (below 1200 tokens) are filtered to focus on genuinely long-context evaluation
  • All prompts are pre-tokenized and moved to GPU before the generation loop

Step 4: Inference_Execution

Run generation across all prepared prompts using the selected inference method. For tree speculation (recommended), perform one warm-up generation to initialize CUDA kernels, then iterate through all test prompts. For each prompt, record the number of accepted tokens, verification rounds, elapsed wall-clock time, and the speculation acceptance mask. Results accumulate across the full test set.

Key considerations:

  • The tree method uses tree_shape=[4, 16, 16, 16, 16] by default
  • A warm-up pass is critical for accurate timing (initializes Triton kernel compilation)
  • Sequential method uses gamma parameter for draft length
  • Temperature=0 provides deterministic greedy decoding for consistent benchmarking
  • max_gen_len controls generation length (1024 for LongBench, 20000 for AIME)

Step 5: Metrics_Collection

Aggregate evaluation metrics across all test prompts. Key metrics include total tokens generated, total elapsed time, tokens-per-second throughput, and average acceptance rate (ratio of accepted draft tokens to total verification rounds). Results are printed to stdout and optionally saved to log files in the long-bench_results/ directory.

Key considerations:

  • Throughput = (accepted_tokens + verification_rounds) / total_time
  • Acceptance rate = (accepted_tokens + verification_rounds) / verification_rounds
  • Results are written to output_{task}.txt for LongBench and output_aime.txt for AIME
  • Per-prompt decoded outputs can be inspected for quality verification
  • Compare results across methods (vanilla vs. seq vs. tree) to quantify speedup

Execution Diagram

GitHub URL

Workflow Repository