Workflow:Romsto Speculative Decoding Interactive CLI Comparison

Knowledge Sources	Speculative-Decoding Fast Inference from Transformers via Speculative Decoding
Domains	LLMs, Inference_Optimization, Speculative_Decoding, Benchmarking
Last Updated	2026-02-14 04:30 GMT

Overview

End-to-end process for interactively comparing speculative decoding, n-gram assisted speculative decoding, and autoregressive baseline generation using the repository's CLI application.

Description

This workflow uses the infer.py CLI application to load both a target and drafter model, initialize an n-gram storage, and run all three generation strategies side-by-side on the same prompt. The CLI provides slash commands for toggling generation modes, adjusting hyperparameters (gamma, generation length, sampling strategy), configuring n-gram storage type, and enabling debug visualization. Results include throughput measurements and acceptance rates for direct comparison.

Goal: Interactively compare the speed and output quality of speculative decoding, NASD, and standard autoregressive generation on arbitrary prompts.

Scope: Covers launching the CLI, loading models, configuring generation parameters via slash commands, running comparison generation, and interpreting throughput and acceptance rate metrics.

Strategy: The CLI runs all enabled generation strategies sequentially on the same prompt with the same random seed, producing directly comparable results with timing measurements.

Usage

Execute this workflow when you want to benchmark and compare the different decoding strategies available in the repository. This is useful for evaluating acceptance rates and throughput gains of speculative decoding versus NASD versus baseline autoregressive decoding, tuning hyperparameters like gamma and sampling settings, or demonstrating the effect of n-gram storage configuration on NASD performance.

Execution Steps

Step 1: Install Dependencies

Set up the Python environment with all required libraries. The CLI application requires additional dependencies beyond the core generation functions, including the rich library for formatted console output and termcolor for colored text.

Key considerations:

All dependencies are listed in requirements.txt
A CUDA-capable GPU is strongly recommended for practical generation speeds
Sufficient GPU memory is needed for both target and drafter models simultaneously

Step 2: Configure Model Selection

Set the target and drafter model names in the infer.py file. The default configuration uses Llama-3.2-3B-Instruct as the target and Llama-3.2-1B-Instruct as the drafter, both with int8 quantization. For encoder-decoder models, the generate method calls must also be changed to their encoder-decoder variants.

Key considerations:

Both models must share the same tokenizer
The drafter must have the same vocabulary size as the target
Quantization configuration can be adjusted or disabled per model
The end-of-turn token list may need updating for different model families

Step 3: Launch CLI Application

Start the interactive CLI by running infer.py. The application loads both models and the tokenizer, initializes the default n-gram storage, and enters an interactive REPL loop. Model loading occurs at startup before the prompt appears.

Key considerations:

The --device argument controls which device to use (default: cuda)
Model loading may consume significant time and memory
The CLI displays the loaded model names at startup for confirmation
Chat mode is enabled by default for instruction-tuned models

Step 4: Configure Generation Parameters

Use slash commands to adjust generation hyperparameters before running prompts. Key commands include /gamma for draft count, /length for maximum generation length, /processor for sampling strategy selection, and toggle commands for enabling or disabling specific generation modes.

Available commands:

/gamma <value> sets the number of drafts per speculative step
/length <value> sets the maximum token generation length
/processor <name> <args> selects sampling strategy (greedy, multinomial, topk, nucleus, topknucleus)
/speculative toggles speculative decoding on/off
/ngram toggles n-gram assisted generation on/off
/target toggles baseline autoregressive generation on/off
/drafter toggles drafter-only autoregressive generation on/off
/cache toggles KV-cache usage on/off
/chat toggles chat template application on/off
/debug toggles debug visualization of accepted/rejected drafts
/set_ngramstorage <basic/onelevel> <n> configures n-gram storage type and order
/top_k_filler <value> sets the top-k filler parameter for n-gram updates
/reset_in_between toggles whether n-gram storage resets between generations

Step 5: Run Comparison Generation

Enter a text prompt at the CLI. The application applies the chat template (if enabled), tokenizes the input, and runs each enabled generation strategy sequentially with the same random seed (42). For each strategy, it displays the generated text, timing information, and throughput in tokens per second. Speculative decoding and NASD also report their acceptance rates.

What happens:

The prompt is tokenized and optionally chat-templated
If enabled, speculative decoding runs first and reports output, acceptance rate, and throughput
If enabled, NASD runs next with the same seed and reports its metrics
If enabled, target autoregressive baseline runs and reports throughput
If enabled, drafter autoregressive generation runs for reference
Throughput comparisons are displayed between methods

Step 6: Analyze Results

Interpret the comparison output to assess the effectiveness of each decoding strategy. Key metrics are throughput (tokens per second), acceptance rate (for speculative and NASD), and output quality (text should be similar or identical across methods under greedy decoding).

Key considerations:

Under greedy decoding, speculative decoding should produce identical output to baseline autoregressive
Under nucleus sampling, outputs may differ due to random sampling but follow the same distribution
Higher acceptance rates indicate better drafter-target alignment
Throughput gain depends on acceptance rate and the gamma hyperparameter
Debug mode (/debug) shows color-coded visualization of which draft tokens were accepted or rejected
Experiment with different gamma values to find the optimal setting for your model pair

Execution Diagram

GitHub URL

Workflow Repository