Principle:Ggml org Llama cpp Target Model Loading

Field	Value
Principle Name	Target Model Loading
Workflow	Speculative_Decoding
Step	2 of 5
Domain	Model Loading for Speculative Verification
Scope	Loading the target (large) model that verifies speculative draft tokens

Overview

Description

In speculative decoding, the target model is the large, high-quality model whose output distribution defines the correct generation. It serves as the verifier: after the draft mechanism produces candidate tokens, the target model evaluates all candidates in a single batched forward pass, accepting those whose probability under the target distribution meets the acceptance criteria and rejecting the rest.

Loading the target model is the foundational step of the speculative decoding pipeline. The target model must be loaded with a context that supports batch evaluation of multiple token positions simultaneously, as this is the key mechanism that enables the speedup -- verifying k draft tokens in parallel costs approximately the same as generating a single token autoregressively.

Usage

Target model loading is required in every speculative decoding configuration. The target model:

Must have sufficient context size to accommodate the prompt plus generated tokens
Should be loaded with the maximum batch size needed for parallel verification
Provides the vocabulary used by the entire speculative decoding pipeline
Determines the authoritative output distribution for token acceptance

Theoretical Basis

The target model in speculative decoding plays the role of the oracle in the draft-then-verify framework. The theoretical basis for why this approach yields speedup is:

Parallel verification property: A Transformer model can evaluate multiple token positions in a single forward pass with cost comparable to evaluating a single position, due to the parallel nature of attention computation. If the draft mechanism provides k candidate tokens, the target model can compute the probability distribution for all k+1 positions (the k drafts plus the next token after the last accepted draft) in roughly the same wall-clock time as generating one token.

Expected speedup: If the acceptance rate is p (probability that each draft token matches the target distribution) and k draft tokens are generated per step, the expected number of accepted tokens per verification step is:

E[accepted] = (1 - p^(k+1)) / (1 - p)    (for geometric distribution)

For a target model that takes time T_target per forward pass and a draft model that takes time T_draft per token, the speedup factor is approximately:

speedup = E[accepted] / (1 + k * T_draft / T_target)

This means the target model should be significantly slower than the draft mechanism for speculative decoding to provide meaningful speedup.

Context requirements: The target model context must be initialized with parameters that support the speculative decoding workflow:

Batch size must accommodate at least n_max + 1 tokens for verification
The context must support sequential evaluation for maintaining the KV cache
The vocabulary from the target model is the canonical vocabulary for the entire pipeline

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment