Workflow:Sail sg LongSpec Long Context Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Speculative_Decoding, Benchmarking |
| Last Updated | 2025-02-01 00:00 GMT |
Overview
Systematic evaluation pipeline for benchmarking LongSpec speculative decoding performance across long-context NLP tasks (LongBench) and mathematical reasoning tasks (AIME) using multiple inference methods.
Description
This workflow evaluates the inference speed and quality of LongSpec's speculative decoding system across two benchmark suites:
LongBench: A suite of long-context tasks including government report summarization (GovReport), meeting summarization (QMSum), multi-document summarization (MultiNews), and code completion (LCC, RepoBench-P). Each task tests the system with inputs ranging from 1,200 to 262,000 tokens.
AIME: Mathematical reasoning benchmark using the AI-MO/aimo-validation-aime dataset, specifically testing the QwQ-32B-Preview model on AIME 2024 problems. These problems require long chain-of-thought reasoning (up to 20,000 generated tokens).
The evaluation compares throughput (tokens per second) and acceptance rate across different inference methods (vanilla, sequential, tree, MagicDec) to quantify the speedup provided by speculative decoding.
Usage
Execute this workflow when you need to benchmark a trained LongSpec model's inference performance, compare different speculative decoding methods, or validate that a newly trained draft model matches expected speedup metrics. This is typically run after completing the GLIDE Draft Model Training workflow, using the final checkpoint.
Execution Steps
Step 1: Benchmark_Data_Preparation
Prepare evaluation datasets for the target benchmark suite. For LongBench tasks, preprocess the raw data into JSONL format with one entry per line, each containing context, input, and answer fields. For AIME evaluation, the dataset is loaded directly from Hugging Face (AI-MO/aimo-validation-aime). Organize data files into an accessible directory structure.
Key considerations:
- LongBench data must be preprocessed into .jsonl format before use
- AIME data is loaded via the datasets library (no preprocessing needed)
- LongBench tasks: gov_report, qmsum, multi_news, lcc, repobench-p
- AIME questions with id 60-89 correspond to AIME 2024 problems
Step 2: Model_and_Tokenizer_Loading
Load the target LLM and its corresponding GLIDE draft model for evaluation. For LongBench tasks (Llama family), instantiate LlamaGlide with the appropriate target and draft model pair. For AIME tasks (Qwen2 family), instantiate Qwen2Glide with QwQ-32B-Preview as target. Configure model-specific token IDs and context length limits.
Key considerations:
- Supported Llama models: Vicuna-7B/13B, LongChat-7B/13B, Llama-3-8B-262k
- Supported Qwen2 models: QwQ-32B-Preview
- Context length varies by model (16k to 262k tokens, minus 2000 for generation headroom)
- A single 80GB GPU is recommended to avoid out-of-memory issues
Step 3: Prompt_Formatting
Apply task-specific prompt templates to each evaluation example. LongBench tasks use Llama chat format with system/user/assistant delimiters and task-specific instructions (e.g., "Write a one-page summary" for GovReport). AIME tasks use Qwen2 im_start/im_end format. After formatting, tokenize prompts and filter by length constraints (minimum 1200 tokens for LongBench).
Key considerations:
- Each LongBench task has a dedicated prompt template with context and optional input slots
- Prompts exceeding the model's context length are excluded
- Short prompts (below 1200 tokens) are filtered to focus on genuinely long-context evaluation
- All prompts are pre-tokenized and moved to GPU before the generation loop
Step 4: Inference_Execution
Run generation across all prepared prompts using the selected inference method. For tree speculation (recommended), perform one warm-up generation to initialize CUDA kernels, then iterate through all test prompts. For each prompt, record the number of accepted tokens, verification rounds, elapsed wall-clock time, and the speculation acceptance mask. Results accumulate across the full test set.
Key considerations:
- The tree method uses tree_shape=[4, 16, 16, 16, 16] by default
- A warm-up pass is critical for accurate timing (initializes Triton kernel compilation)
- Sequential method uses gamma parameter for draft length
- Temperature=0 provides deterministic greedy decoding for consistent benchmarking
- max_gen_len controls generation length (1024 for LongBench, 20000 for AIME)
Step 5: Metrics_Collection
Aggregate evaluation metrics across all test prompts. Key metrics include total tokens generated, total elapsed time, tokens-per-second throughput, and average acceptance rate (ratio of accepted draft tokens to total verification rounds). Results are printed to stdout and optionally saved to log files in the long-bench_results/ directory.
Key considerations:
- Throughput = (accepted_tokens + verification_rounds) / total_time
- Acceptance rate = (accepted_tokens + verification_rounds) / verification_rounds
- Results are written to output_{task}.txt for LongBench and output_aime.txt for AIME
- Per-prompt decoded outputs can be inspected for quality verification
- Compare results across methods (vanilla vs. seq vs. tree) to quantify speedup