Heuristic:Romsto Speculative Decoding Filler Top K Tuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Optimization, Speculative_Decoding |
| Last Updated | 2026-02-14 04:30 GMT |
Overview
The `filler_top_k` parameter controls how many top-k tokens from the target model are used to update the N-gram storage during NASD generation, trading memory for better draft coverage.
Description
During N-gram Assisted Speculative Decoding, after each verification step the N-gram storage is updated with newly observed tokens. The `filler_top_k` parameter determines whether only the single accepted token is stored (filler_top_k=1) or whether the top-k most probable tokens from the target model's output distribution are also stored (filler_top_k > 1). Setting filler_top_k=1 reproduces the NAPD approach (Ou et al., 2024), while higher values enrich the storage with alternative continuations.
Usage
Use this heuristic when tuning NASD performance. Increase `filler_top_k` if the N-gram storage has poor coverage (many unknown contexts) and you want to speculatively populate it with likely alternatives. Decrease to 1 for minimal memory usage or to reproduce NAPD results.
The Insight (Rule of Thumb)
- Action: Set `filler_top_k` via the `/top_k_filler <value>` CLI command or the `filler_top_k` parameter in `ngram_assisted_speculative_generate()`.
- Default Value: `filler_top_k=3` in `infer.py:38`.
- Trade-off: Higher values populate the storage faster (more n-gram entries) but consume more memory. Lower values (especially 1) keep memory minimal but may result in more "unknown" contexts.
- NAPD reproduction: Set `top_k_filler=1` to match the NAPD paper approach.
Reasoning
From `README.md`:
"To reproduce their results, you can use the NASD implementation by simply setting `top_k_filler=1`."
The update logic in `ngram_assisted/ngram_assisted.py:149-155` shows how filler_top_k enriches the storage:
for i in range(n):
ngramstorage.update(input_ids[..., :current_position + i], input_ids[..., current_position + i].unsqueeze(0))
if filler_top_k > 1:
ngramstorage.update(input_ids[..., :current_position + i], p[..., i, :].topk(filler_top_k).indices)
ngramstorage.update(input_ids[..., :current_position + n], x)
if filler_top_k > 1:
ngramstorage.update(input_ids[..., :current_position + n], p_p.topk(filler_top_k).indices)
When `filler_top_k > 1`, at each accepted position the storage receives not just the actual accepted token but also the top-k most probable tokens from the target's distribution. This means the storage learns about plausible alternatives that were not taken, improving future draft quality. The cost is additional dictionary entries in memory (proportional to `filler_top_k * tokens_generated * n_levels`).