Implementation:Sail sg LongSpec Throughput Calculator
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Performance_Analysis |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Concrete tool for calculating and logging speculative decoding throughput, acceptance rates, and per-sample metrics from inference results.
Description
Inline metrics calculation in inference_long-bench.py and inference_qwq.py computes throughput from generation method return values. Includes warmup handling and cumulative averaging.
This is a Pattern Doc — metrics are computed inline in evaluation scripts.
Usage
Calculated automatically after each sample's generation call and aggregated at the end of evaluation.
Code Reference
Source Location
- Repository: LongSpec
- File (LongBench): longspec/test/inference_long-bench.py
- Lines: L228-261
- File (AIME): longspec/test/inference_qwq.py
- Lines: L120-153
Signature
# Pattern Doc: Inline metrics calculation
# For tree/sequential speculative decoding:
output_ids, count, num, elapsed_time, spec_mask = model.tree_spec_generate(...)
throughput = num / elapsed_time # tokens per second
acceptance_rate = count / num # fraction of accepted drafts
# For vanilla generation:
output_ids, num, elapsed_time = model.vanilla_generate(...)
throughput = num / elapsed_time
# Warmup handling (first iteration excluded from aggregate):
if sample_idx == 0:
continue # Skip warmup
# Cumulative metrics:
total_tokens += num
total_time += elapsed_time
avg_throughput = total_tokens / total_time
Import
import time
import os
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_ids | torch.Tensor | Yes | Generated token sequence from generation method |
| count | int | Yes (spec methods) | Number of accepted draft tokens |
| num | int | Yes | Total tokens processed by target model |
| elapsed_time | float | Yes | Wall-clock generation time in seconds |
Outputs
| Name | Type | Description |
|---|---|---|
| throughput | float | Tokens per second (printed to stdout) |
| acceptance_rate | float | Accepted/total draft ratio (printed to stdout) |
| log_file | File (AIME only) | Metrics written to ./long-bench_results/output_aime.txt |
Usage Examples
LongBench Metrics
# Inside evaluation loop (inference_long-bench.py):
total_tokens = 0
total_time = 0.0
for idx, (prompt, prompt_len, item) in enumerate(data):
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
output_ids, count, num, elapsed, mask = model.tree_spec_generate(
input_ids, prompt_len, tree_shape=[4, 16, 16, 16, 16],
max_gen_len=1024
)
if idx == 0:
continue # Warmup
total_tokens += num
total_time += elapsed
print(f"Sample {idx}: {num} tokens, {elapsed:.2f}s, "
f"{num/elapsed:.1f} tok/s, accept={count/num:.2%}")
print(f"Average: {total_tokens/total_time:.1f} tok/s")
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment