Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sail sg LongSpec Throughput Calculator

From Leeroopedia
Knowledge Sources
Domains Evaluation, Performance_Analysis
Last Updated 2026-02-14 05:00 GMT

Overview

Concrete tool for calculating and logging speculative decoding throughput, acceptance rates, and per-sample metrics from inference results.

Description

Inline metrics calculation in inference_long-bench.py and inference_qwq.py computes throughput from generation method return values. Includes warmup handling and cumulative averaging.

This is a Pattern Doc — metrics are computed inline in evaluation scripts.

Usage

Calculated automatically after each sample's generation call and aggregated at the end of evaluation.

Code Reference

Source Location

  • Repository: LongSpec
  • File (LongBench): longspec/test/inference_long-bench.py
  • Lines: L228-261
  • File (AIME): longspec/test/inference_qwq.py
  • Lines: L120-153

Signature

# Pattern Doc: Inline metrics calculation

# For tree/sequential speculative decoding:
output_ids, count, num, elapsed_time, spec_mask = model.tree_spec_generate(...)
throughput = num / elapsed_time  # tokens per second
acceptance_rate = count / num    # fraction of accepted drafts

# For vanilla generation:
output_ids, num, elapsed_time = model.vanilla_generate(...)
throughput = num / elapsed_time

# Warmup handling (first iteration excluded from aggregate):
if sample_idx == 0:
    continue  # Skip warmup

# Cumulative metrics:
total_tokens += num
total_time += elapsed_time
avg_throughput = total_tokens / total_time

Import

import time
import os

I/O Contract

Inputs

Name Type Required Description
output_ids torch.Tensor Yes Generated token sequence from generation method
count int Yes (spec methods) Number of accepted draft tokens
num int Yes Total tokens processed by target model
elapsed_time float Yes Wall-clock generation time in seconds

Outputs

Name Type Description
throughput float Tokens per second (printed to stdout)
acceptance_rate float Accepted/total draft ratio (printed to stdout)
log_file File (AIME only) Metrics written to ./long-bench_results/output_aime.txt

Usage Examples

LongBench Metrics

# Inside evaluation loop (inference_long-bench.py):
total_tokens = 0
total_time = 0.0

for idx, (prompt, prompt_len, item) in enumerate(data):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

    output_ids, count, num, elapsed, mask = model.tree_spec_generate(
        input_ids, prompt_len, tree_shape=[4, 16, 16, 16, 16],
        max_gen_len=1024
    )

    if idx == 0:
        continue  # Warmup

    total_tokens += num
    total_time += elapsed

    print(f"Sample {idx}: {num} tokens, {elapsed:.2f}s, "
          f"{num/elapsed:.1f} tok/s, accept={count/num:.2%}")

print(f"Average: {total_tokens/total_time:.1f} tok/s")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment