Workflow:Ggml org Llama cpp Speculative Decoding

Knowledge Sources	llama.cpp Speculative Decoding Docs Speculative Simple Example
Domains	LLMs, Inference, Performance_Optimization
Last Updated	2026-02-14 22:00 GMT

Overview

End-to-end process for accelerating language model inference using speculative decoding, where a smaller draft model proposes candidate tokens that are verified in parallel by the larger target model.

Description

This workflow implements speculative decoding to speed up text generation without changing the output distribution. The core idea is to use a small, fast "draft" model to predict multiple tokens ahead, then verify all predictions at once with the larger "target" model in a single forward pass. When the draft model's predictions match the target model's distribution (which happens frequently for predictable text), multiple tokens are accepted per decode step instead of just one, resulting in significant speedup. llama.cpp supports multiple speculation strategies: draft model, self-speculation with n-gram caching, n-gram map lookup, and prompt lookup decoding.

Usage

Execute this workflow when you need faster token generation from a large model and have access to a compatible smaller draft model or are willing to use n-gram-based self-speculation. Speculative decoding is most effective when generating long sequences, when the draft model is architecturally compatible with the target, and when the text being generated has predictable patterns.

Execution Steps

Step 1: Select Speculation Strategy

Choose the appropriate speculative decoding strategy based on available resources and the use case. The main options are draft-model speculation (requires a small compatible model), n-gram self-speculation (no extra model needed), and prompt lookup (uses patterns from the prompt itself).

Strategy options:

Draft model: Best speedup, requires separate small model from same family
N-gram cache: No extra model needed, builds cache during generation
N-gram map: Pre-built n-gram lookup from corpus
Prompt lookup: Uses n-gram patterns found in the input prompt

Step 2: Load Target Model

Load the main (large) target model that produces the final output. This is the model whose output quality you want to preserve while accelerating generation speed.

Key considerations:

The target model determines output quality and distribution
GPU layer offloading maximizes the speed benefit of batch verification
Context size must accommodate both prompt and generated content

Step 3: Load Draft Model

Load the smaller draft model that will propose candidate tokens. For n-gram strategies, initialize the n-gram cache or map instead of loading a second model.

Key considerations:

Draft model should be from the same model family for high acceptance rates
Draft model should be significantly smaller (e.g., 1B draft for 70B target)
Both models must share the same vocabulary
N-gram approaches require no additional model but have lower acceptance rates

Step 4: Configure Speculation Parameters

Set the speculation parameters including the number of draft tokens per step, minimum acceptance probability threshold, and any strategy-specific settings.

Key considerations:

Draft count (n_draft) controls how many tokens to speculate ahead (typical: 8-16)
Higher draft counts increase potential speedup but reduce acceptance rate
Minimum probability threshold (p_min) filters low-confidence drafts
The optimal settings depend on the model pair and text domain

Step 5: Run Speculative Generation

Execute the speculative generation loop: the draft model generates a sequence of candidate tokens, the target model evaluates all candidates in a single batched forward pass, and accepted tokens are emitted while rejected tokens cause a rollback to the last accepted position. The process repeats until generation is complete.

What happens per iteration:

Draft model generates n_draft candidate tokens autoregressively
All candidates plus the last accepted token form a verification batch
Target model processes the entire batch in one forward pass
Token-by-token comparison determines how many drafts to accept
Accepted tokens are output; generation continues from last accepted position

Key considerations:

Acceptance rate varies by text difficulty (higher for predictable text)
The speedup is roughly: accepted_tokens / (draft_time + verify_time)
No quality degradation: output distribution matches non-speculative generation
Performance tracking reports tokens/second and acceptance rate

Execution Diagram

GitHub URL

Workflow Repository