Principle:Ggml org Llama cpp Speculative Generation

Field	Value
Principle Name	Speculative Generation
Workflow	Speculative_Decoding
Step	5 of 5 (CORE)
Domain	Draft-Then-Verify Token Generation
Scope	Theory of speculative decoding: draft-then-verify parallel token generation

Overview

Description

Speculative generation is the core runtime loop of speculative decoding. It implements the draft-then-verify paradigm: in each step, the draft mechanism proposes multiple candidate tokens, the target model evaluates them all in a single batched forward pass, and accepted tokens are emitted while rejected tokens trigger a fallback to the target model's own prediction. This process repeats until the generation is complete.

This is the step where the theoretical speedup of speculative decoding is realized. By converting what would be k sequential target model forward passes into 1 draft step + 1 verification step, the system generates multiple tokens per target model evaluation.

Usage

Speculative generation is the inner loop of any speculative decoding application. It consists of three phases per iteration:

Draft phase: Generate k candidate tokens using the configured draft mechanism
Verify phase: Evaluate all k candidates plus one additional position with the target model
Accept phase: Compare draft tokens against target model samples, accept matching prefix, inform draft mechanism of acceptance count

Theoretical Basis

The Draft-Then-Verify Algorithm:

repeat:
    1. DRAFT: Generate k candidate tokens [t_1, t_2, ..., t_k] using draft mechanism
    2. VERIFY: Run target model on [t_1, ..., t_k] in a single forward pass
       to get target distributions p_target(x | context, t_1, ..., t_i) for i = 0..k
    3. ACCEPT: For i = 1 to k:
       - Sample token s_i from target distribution at position i
       - If s_i == t_i: accept t_i, continue
       - If s_i != t_i: reject t_i, emit s_i as the correct token, break
    4. If all k tokens accepted, sample one more token from the target distribution
       at position k+1
    5. Inform draft mechanism of n_accepted tokens
    6. Update context with accepted tokens

Acceptance Probability:

Under the standard speculative decoding formulation (Leviathan et al., 2023), for each draft token t at position i, the acceptance probability is:

P(accept t_i) = min(1, p_target(t_i) / p_draft(t_i))

In the greedy variant used by llama.cpp's common_sampler_sample_and_accept_n, acceptance is simpler: a draft token is accepted if and only if the target model's sampled token at that position matches the draft token:

accept t_i if and only if sample(p_target(x | context, t_1, ..., t_{i-1})) == t_i

Expected Tokens Per Step:

If the acceptance rate per token is p (assumed independent), the expected number of accepted tokens from k draft tokens is:

E[accepted] = sum_{j=0}^{k-1} p^j * (1-p) * j + p^k * k
            = (1 - p^k) / (1 - p)  (approximate, for p < 1)

Plus one additional token from the target model's own sampling, giving:

E[tokens_per_step] = (1 - p^(k+1)) / (1 - p)

Expected Speedup:

The speedup factor compared to standard autoregressive generation is:

speedup = E[tokens_per_step] / (1 + k * T_draft / T_target)

Where T_draft is the time for one draft step (generating k tokens) and T_target is the time for one target model forward pass (verifying k+1 positions).

Practical Considerations:

The p_min parameter (default 0.75) sets a threshold below which speculative decoding may be skipped
The p_split parameter (default 0.1) controls tree-branching in more advanced speculation schemes
N-gram strategies have effectively zero draft cost (T_draft ~= 0) but lower acceptance rates
Draft model strategies have higher acceptance rates but non-trivial T_draft

Implementation flow in llama.cpp:

common_speculative_draft() iterates through the strategy chain and returns draft tokens from the first successful implementation
The target model evaluates the draft tokens and the caller compares via common_sampler_sample_and_accept_n()
common_speculative_accept() informs the winning implementation of how many tokens were accepted, allowing it to update internal state (n-gram maps, KV cache, statistics)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment