Implementation:Ggml org Llama cpp Draft Model Init
| Field | Value |
|---|---|
| Implementation Name | Draft Model Init |
| Doc Type | Pattern Doc |
| Workflow | Speculative_Decoding |
| Step | 3 of 5 |
| Source File | examples/speculative-simple/speculative-simple.cpp
|
Overview
Description
This implementation documents the draft model loading pattern used in the speculative-simple example. The draft model is loaded as a separate llama_model instance using llama_model_load_from_file with parameters derived from the target model's configuration. The draft model configuration overrides several parameters from the target: it runs with a single parallel sequence, uses the target's per-sequence context length as its batch size, and applies draft-specific GPU layer and threading settings.
After loading, the draft model pointer is stored in params.speculative.model_dft so it can be shared across multiple speculative decoding contexts, and the context parameters are stored in params.speculative.cparams_dft.
Usage
auto mparams_dft = common_model_params_to_llama(params_dft);
model_dft.reset(llama_model_load_from_file(params_dft.model.path.c_str(), mparams_dft));
params.speculative.model_dft = model_dft.get();
params.speculative.cparams_dft = common_context_params_to_llama(params_dft);
Code Reference
| Field | Value |
|---|---|
| Source Location | examples/speculative-simple/speculative-simple.cpp:49-81
|
| Signature | Pattern using llama_model_load_from_file(path, mparams)
|
| Import | #include "llama.h", #include "common.h"
|
Draft model loading pattern:
// load the draft model
llama_model_ptr model_dft;
{
const auto & params_spec = params.speculative;
auto params_dft = params;
params_dft.n_parallel = 1;
params_dft.n_ctx = params_spec.n_ctx;
params_dft.n_batch = llama_n_ctx_seq(ctx_tgt);
params_dft.devices = params_spec.devices;
params_dft.model = params_spec.mparams_dft;
params_dft.n_gpu_layers = params_spec.n_gpu_layers;
if (params_spec.cpuparams.n_threads > 0) {
params_dft.cpuparams.n_threads = params.speculative.cpuparams.n_threads;
params_dft.cpuparams_batch.n_threads = params.speculative.cpuparams_batch.n_threads;
}
params_dft.tensor_buft_overrides = params.speculative.tensor_buft_overrides;
auto mparams_dft = common_model_params_to_llama(params_dft);
model_dft.reset(llama_model_load_from_file(params_dft.model.path.c_str(), mparams_dft));
if (model_dft == nullptr) {
LOG_ERR("failed to load draft model, '%s'\n", params_dft.model.path.c_str());
return 1;
}
params.speculative.model_dft = model_dft.get();
params.speculative.cparams_dft = common_context_params_to_llama(params_dft);
}
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | params.speculative.mparams_dft.path | std::string |
Path to the draft model GGUF file (--model-draft CLI flag) |
| Input | params.speculative.n_ctx | int32_t |
Draft model context size |
| Input | params.speculative.n_gpu_layers | int32_t |
GPU layers for draft model (-1 for default) |
| Input | ctx_tgt | llama_context * |
Target context (used for llama_n_ctx_seq() to set draft batch size)
|
| Output | model_dft | llama_model_ptr |
Loaded draft model (smart pointer) |
| Output | params.speculative.model_dft | llama_model * |
Raw pointer stored for sharing across speculative contexts |
| Output | params.speculative.cparams_dft | llama_context_params |
Context parameters for creating draft inference contexts |
Parameter overrides applied to draft model:
n_parallel = 1(single sequence generation)n_batch = llama_n_ctx_seq(ctx_tgt)(batch size from target per-sequence context)- Model path from
params.speculative.mparams_dft - GPU layers from
params.speculative.n_gpu_layers - Thread counts from
params.speculative.cpuparams(if specified) - Device and tensor buffer overrides from speculative params
Usage Examples
Complete draft model loading in speculative-simple:
#include "common.h"
#include "llama.h"
common_params params;
common_params_parse(argc, argv, params, LLAMA_EXAMPLE_SPECULATIVE);
// Verify draft model is specified
if (params.speculative.mparams_dft.path.empty()) {
fprintf(stderr, "--model-draft is required\n");
return 1;
}
// Load target model first
auto llama_init_tgt = common_init_from_params(params);
llama_model * model_tgt = llama_init_tgt->model();
llama_context * ctx_tgt = llama_init_tgt->context();
// Load draft model with derived parameters
llama_model_ptr model_dft;
{
auto params_dft = params;
params_dft.n_parallel = 1;
params_dft.n_ctx = params.speculative.n_ctx;
params_dft.n_batch = llama_n_ctx_seq(ctx_tgt);
params_dft.model = params.speculative.mparams_dft;
params_dft.n_gpu_layers = params.speculative.n_gpu_layers;
auto mparams_dft = common_model_params_to_llama(params_dft);
model_dft.reset(llama_model_load_from_file(
params_dft.model.path.c_str(), mparams_dft));
params.speculative.model_dft = model_dft.get();
params.speculative.cparams_dft = common_context_params_to_llama(params_dft);
}