Heuristic:Romsto Speculative Decoding Shared Tokenizer Requirement

Knowledge Sources	Romsto Speculative-Decoding Fast Inference from Transformers via Speculative Decoding
Domains	LLMs, Speculative_Decoding
Last Updated	2026-02-14 04:30 GMT

Overview

Target and drafter models must share the same tokenizer and output logits of the same vocabulary size for speculative decoding to produce correct results.

Description

Speculative decoding compares the probability distributions of the drafter and target models token-by-token. For this comparison to be valid, both models must operate over the same vocabulary. If the drafter uses a different tokenizer (different vocab, different subword splits), the token IDs will not align, and the rejection sampling step will compare unrelated probabilities — producing garbage output. Additionally, both models must output logits of the same shape (vocabulary size).

Usage

Use this heuristic when selecting a drafter model for speculative decoding. Always verify that the drafter shares the exact same tokenizer as the target. Model families that share tokenizers (e.g., Llama-3.2-1B and Llama-3.2-3B) are natural pairings. Cross-family pairings (e.g., GPT-2 drafter with Llama target) will not work.

The Insight (Rule of Thumb)

Action: Ensure the drafter model uses the same tokenizer as the target model. Both must have identical `vocab_size`.
Best practice: Use models from the same family (e.g., Llama-3.2-1B + Llama-3.2-3B, or Gemma-2B + Gemma-7B).
Validation: Load the tokenizer from the target model name only, and use it for both models.
Trade-off: This constraint limits drafter selection to the same model family, which may not always have a sufficiently small variant available.

Reasoning

From `README.md` (requirements for speculative decoding):

"The drafter model must share the same tokenizer as the target model." "The target model and the drafter model should output same shape logits."

The code loads a single tokenizer from the target model in `infer.py:97-100`:

tokenizer_name = target_model
if tokenizer_name != target_model:
    print(colored("Warning: Tokenizer is different from target model. Use with caution.", "red"))
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)

The `vocabulary_size` is taken from the target model config and used to allocate the draft probability tensor in `sampling/speculative_decoding.py:73-107`:

vocabulary_size = target.config.vocab_size
q = torch.zeros((1, corrected_gamma, vocabulary_size), device=target.device)

If the drafter's vocab_size differs, the probability tensors `p` and `q` will have mismatched dimensions, causing either a runtime error or silent corruption during the rejection sampling fraction `p / q`.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment