Heuristic:Romsto Speculative Decoding Seed Fixing For Reproducibility
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Debugging, Reproducibility |
| Last Updated | 2026-02-14 04:30 GMT |
Overview
Fix all random seeds (Python, NumPy, PyTorch CPU/CUDA, cuDNN) before each generation to enable fair throughput comparisons between decoding strategies.
Description
When comparing autoregressive, speculative, and N-gram assisted generation strategies, stochastic sampling (multinomial, top-k, nucleus) introduces randomness that can cause output divergence between methods. To ensure that any output differences are due to the decoding algorithm rather than random sampling, the CLI fixes all random seeds to the same value (42) before each generation call. This also makes cuDNN deterministic (at a slight performance cost).
Usage
Use this heuristic when benchmarking or comparing generation strategies. If you are comparing throughput or output quality across autoregressive vs. speculative vs. NASD, always fix seeds before each call. This is especially important with non-greedy sampling (multinomial, nucleus, top-k).
The Insight (Rule of Thumb)
- Action: Call the seed-fixing function before each generation method invocation.
- Seed sources to fix: Python `random`, NumPy `np.random`, PyTorch `torch.manual_seed`, CUDA `torch.cuda.manual_seed_all`, cuDNN determinism flags.
- Default seed: 42 (hardcoded in the CLI).
- Trade-off: `torch.backends.cudnn.deterministic = True` and `torch.backends.cudnn.benchmark = False` may slightly reduce GPU throughput (up to ~5%) but guarantee bit-exact reproducibility.
Reasoning
The CLI calls `_set_seed(42)` before every generation method in `infer.py:281`, `infer.py:305`, `infer.py:334`, and `infer.py:356`:
def _set_seed(self, seed: int):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
This ensures that when comparing speculative vs. autoregressive vs. NASD outputs for the same prompt, the random number generator state is identical at the start of each generation. Without this, multinomial sampling would produce different token choices in each method, making it impossible to attribute output differences to the algorithm itself.
Note from `README.md`:
"The difference of text is due to the pseudo-random estimation of computers."
This acknowledges that even with seed fixing, speculative decoding may produce slightly different outputs than autoregressive decoding because the rejection sampling step introduces additional randomness that interacts differently with the fixed seed.