Principle:Ggml org Llama cpp Speculative Generation
| Field | Value |
|---|---|
| Principle Name | Speculative Generation |
| Workflow | Speculative_Decoding |
| Step | 5 of 5 (CORE) |
| Domain | Draft-Then-Verify Token Generation |
| Scope | Theory of speculative decoding: draft-then-verify parallel token generation |
Overview
Description
Speculative generation is the core runtime loop of speculative decoding. It implements the draft-then-verify paradigm: in each step, the draft mechanism proposes multiple candidate tokens, the target model evaluates them all in a single batched forward pass, and accepted tokens are emitted while rejected tokens trigger a fallback to the target model's own prediction. This process repeats until the generation is complete.
This is the step where the theoretical speedup of speculative decoding is realized. By converting what would be k sequential target model forward passes into 1 draft step + 1 verification step, the system generates multiple tokens per target model evaluation.
Usage
Speculative generation is the inner loop of any speculative decoding application. It consists of three phases per iteration:
- Draft phase: Generate k candidate tokens using the configured draft mechanism
- Verify phase: Evaluate all k candidates plus one additional position with the target model
- Accept phase: Compare draft tokens against target model samples, accept matching prefix, inform draft mechanism of acceptance count
Theoretical Basis
The Draft-Then-Verify Algorithm:
repeat:
1. DRAFT: Generate k candidate tokens [t_1, t_2, ..., t_k] using draft mechanism
2. VERIFY: Run target model on [t_1, ..., t_k] in a single forward pass
to get target distributions p_target(x | context, t_1, ..., t_i) for i = 0..k
3. ACCEPT: For i = 1 to k:
- Sample token s_i from target distribution at position i
- If s_i == t_i: accept t_i, continue
- If s_i != t_i: reject t_i, emit s_i as the correct token, break
4. If all k tokens accepted, sample one more token from the target distribution
at position k+1
5. Inform draft mechanism of n_accepted tokens
6. Update context with accepted tokens
Acceptance Probability:
Under the standard speculative decoding formulation (Leviathan et al., 2023), for each draft token t at position i, the acceptance probability is:
P(accept t_i) = min(1, p_target(t_i) / p_draft(t_i))
In the greedy variant used by llama.cpp's common_sampler_sample_and_accept_n, acceptance is simpler: a draft token is accepted if and only if the target model's sampled token at that position matches the draft token:
accept t_i if and only if sample(p_target(x | context, t_1, ..., t_{i-1})) == t_i
Expected Tokens Per Step:
If the acceptance rate per token is p (assumed independent), the expected number of accepted tokens from k draft tokens is:
E[accepted] = sum_{j=0}^{k-1} p^j * (1-p) * j + p^k * k
= (1 - p^k) / (1 - p) (approximate, for p < 1)
Plus one additional token from the target model's own sampling, giving:
E[tokens_per_step] = (1 - p^(k+1)) / (1 - p)
Expected Speedup:
The speedup factor compared to standard autoregressive generation is:
speedup = E[tokens_per_step] / (1 + k * T_draft / T_target)
Where T_draft is the time for one draft step (generating k tokens) and T_target is the time for one target model forward pass (verifying k+1 positions).
Practical Considerations:
- The p_min parameter (default 0.75) sets a threshold below which speculative decoding may be skipped
- The p_split parameter (default 0.1) controls tree-branching in more advanced speculation schemes
- N-gram strategies have effectively zero draft cost (T_draft ~= 0) but lower acceptance rates
- Draft model strategies have higher acceptance rates but non-trivial T_draft
Implementation flow in llama.cpp:
- common_speculative_draft() iterates through the strategy chain and returns draft tokens from the first successful implementation
- The target model evaluates the draft tokens and the caller compares via common_sampler_sample_and_accept_n()
- common_speculative_accept() informs the winning implementation of how many tokens were accepted, allowing it to update internal state (n-gram maps, KV cache, statistics)