Principle:Ggml org Llama cpp Target Model Loading
| Field | Value |
|---|---|
| Principle Name | Target Model Loading |
| Workflow | Speculative_Decoding |
| Step | 2 of 5 |
| Domain | Model Loading for Speculative Verification |
| Scope | Loading the target (large) model that verifies speculative draft tokens |
Overview
Description
In speculative decoding, the target model is the large, high-quality model whose output distribution defines the correct generation. It serves as the verifier: after the draft mechanism produces candidate tokens, the target model evaluates all candidates in a single batched forward pass, accepting those whose probability under the target distribution meets the acceptance criteria and rejecting the rest.
Loading the target model is the foundational step of the speculative decoding pipeline. The target model must be loaded with a context that supports batch evaluation of multiple token positions simultaneously, as this is the key mechanism that enables the speedup -- verifying k draft tokens in parallel costs approximately the same as generating a single token autoregressively.
Usage
Target model loading is required in every speculative decoding configuration. The target model:
- Must have sufficient context size to accommodate the prompt plus generated tokens
- Should be loaded with the maximum batch size needed for parallel verification
- Provides the vocabulary used by the entire speculative decoding pipeline
- Determines the authoritative output distribution for token acceptance
Theoretical Basis
The target model in speculative decoding plays the role of the oracle in the draft-then-verify framework. The theoretical basis for why this approach yields speedup is:
Parallel verification property: A Transformer model can evaluate multiple token positions in a single forward pass with cost comparable to evaluating a single position, due to the parallel nature of attention computation. If the draft mechanism provides k candidate tokens, the target model can compute the probability distribution for all k+1 positions (the k drafts plus the next token after the last accepted draft) in roughly the same wall-clock time as generating one token.
Expected speedup: If the acceptance rate is p (probability that each draft token matches the target distribution) and k draft tokens are generated per step, the expected number of accepted tokens per verification step is:
E[accepted] = (1 - p^(k+1)) / (1 - p) (for geometric distribution)
For a target model that takes time T_target per forward pass and a draft model that takes time T_draft per token, the speedup factor is approximately:
speedup = E[accepted] / (1 + k * T_draft / T_target)
This means the target model should be significantly slower than the draft mechanism for speculative decoding to provide meaningful speedup.
Context requirements: The target model context must be initialized with parameters that support the speculative decoding workflow:
- Batch size must accommodate at least n_max + 1 tokens for verification
- The context must support sequential evaluation for maintaining the KV cache
- The vocabulary from the target model is the canonical vocabulary for the entire pipeline