Principle:Ggml org Llama cpp Draft Model Loading
| Field | Value |
|---|---|
| Principle Name | Draft Model Loading |
| Workflow | Speculative_Decoding |
| Step | 3 of 5 |
| Domain | Draft Model Initialization |
| Scope | Loading draft models for speculative generation: smaller model or n-gram lookup |
Overview
Description
The draft model (or draft mechanism) is responsible for generating candidate tokens cheaply and quickly during speculative decoding. When using a model-based draft strategy, a smaller model of the same architecture family is loaded as a separate llama_model instance with its own parameters. When using n-gram based strategies, no separate model is needed -- the draft mechanism operates entirely on the generation history.
Loading the draft model is a critical step that directly impacts the speculative decoding performance. The draft model must be fast enough relative to the target model to provide a net speedup, while generating draft tokens with sufficiently high acceptance rate under the target distribution.
Usage
Draft model loading is required when using the COMMON_SPECULATIVE_TYPE_DRAFT or COMMON_SPECULATIVE_TYPE_EAGLE3 strategies. For n-gram strategies, no draft model is loaded, and this step configures only the n-gram parameters.
Key considerations:
- The draft model should be significantly smaller than the target model (e.g., 1B draft for a 70B target)
- Draft model context size is typically smaller than the target context
- Draft model batch size is set to the target model's per-sequence context length
- The draft model can have different GPU layer and thread configurations than the target
Theoretical Basis
The draft model's role is to approximate the target model's output distribution at much lower computational cost. The theoretical trade-off is:
Draft quality vs. speed: A draft model that closely approximates the target will have high acceptance rates, but if it is too similar in size, the overhead eliminates the speedup. The optimal draft model minimizes:
cost_total = T_draft * k + T_target
while maximizing the expected acceptance:
E[accepted] = sum_{j=1}^{k}( product_{i=1}^{j}(min(1, p_target(x_i) / p_draft(x_i))) )
Key relationships:
- T_draft / T_target ratio: Should be as small as possible. A ratio of 1:10 or better is typical.
- Acceptance rate p: The probability that a draft token is accepted. Higher is better but depends on domain and model quality.
- Draft length k: More drafts increase potential tokens per step but also increase draft cost. The optimal k depends on the acceptance rate.
Draft model configuration principles:
- Context size: The draft model context is typically set to the minimum needed, often derived from the target context's per-sequence length (
llama_n_ctx_seq(ctx_tgt)). - Parallelism: Draft model runs with
n_parallel = 1since it generates a single token sequence. - Batch size: Set to the target's per-sequence context size for efficient processing.
- Memory: Draft model KV cache uses F16 by default for K and V.
- Vocabulary alignment: The draft model must share the same vocabulary as the target model for token-level verification to work correctly. Token replacements can be configured for models that use different tensor naming.