Heuristic:Romsto Speculative Decoding Seed Fixing For Reproducibility

Knowledge Sources	Romsto Speculative-Decoding
Domains	LLMs, Debugging, Reproducibility
Last Updated	2026-02-14 04:30 GMT

Overview

Fix all random seeds (Python, NumPy, PyTorch CPU/CUDA, cuDNN) before each generation to enable fair throughput comparisons between decoding strategies.

Description

When comparing autoregressive, speculative, and N-gram assisted generation strategies, stochastic sampling (multinomial, top-k, nucleus) introduces randomness that can cause output divergence between methods. To ensure that any output differences are due to the decoding algorithm rather than random sampling, the CLI fixes all random seeds to the same value (42) before each generation call. This also makes cuDNN deterministic (at a slight performance cost).

Usage

Use this heuristic when benchmarking or comparing generation strategies. If you are comparing throughput or output quality across autoregressive vs. speculative vs. NASD, always fix seeds before each call. This is especially important with non-greedy sampling (multinomial, nucleus, top-k).

The Insight (Rule of Thumb)

Action: Call the seed-fixing function before each generation method invocation.
Seed sources to fix: Python `random`, NumPy `np.random`, PyTorch `torch.manual_seed`, CUDA `torch.cuda.manual_seed_all`, cuDNN determinism flags.
Default seed: 42 (hardcoded in the CLI).
Trade-off: `torch.backends.cudnn.deterministic = True` and `torch.backends.cudnn.benchmark = False` may slightly reduce GPU throughput (up to ~5%) but guarantee bit-exact reproducibility.

Reasoning

The CLI calls `_set_seed(42)` before every generation method in `infer.py:281`, `infer.py:305`, `infer.py:334`, and `infer.py:356`:

def _set_seed(self, seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

This ensures that when comparing speculative vs. autoregressive vs. NASD outputs for the same prompt, the random number generator state is identical at the start of each generation. Without this, multinomial sampling would produce different token choices in each method, making it impossible to attribute output differences to the algorithm itself.

Note from `README.md`:

"The difference of text is due to the pseudo-random estimation of computers."

This acknowledges that even with seed fixing, speculative decoding may produce slightly different outputs than autoregressive decoding because the rejection sampling step introduces additional randomness that interacts differently with the fixed seed.

Related Pages

Implementation:Romsto_Speculative_Decoding_InferenceCLI

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment