Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Draft Model Init

From Leeroopedia
Field Value
Implementation Name Draft Model Init
Doc Type Pattern Doc
Workflow Speculative_Decoding
Step 3 of 5
Source File examples/speculative-simple/speculative-simple.cpp

Overview

Description

This implementation documents the draft model loading pattern used in the speculative-simple example. The draft model is loaded as a separate llama_model instance using llama_model_load_from_file with parameters derived from the target model's configuration. The draft model configuration overrides several parameters from the target: it runs with a single parallel sequence, uses the target's per-sequence context length as its batch size, and applies draft-specific GPU layer and threading settings.

After loading, the draft model pointer is stored in params.speculative.model_dft so it can be shared across multiple speculative decoding contexts, and the context parameters are stored in params.speculative.cparams_dft.

Usage

auto mparams_dft = common_model_params_to_llama(params_dft);
model_dft.reset(llama_model_load_from_file(params_dft.model.path.c_str(), mparams_dft));
params.speculative.model_dft = model_dft.get();
params.speculative.cparams_dft = common_context_params_to_llama(params_dft);

Code Reference

Field Value
Source Location examples/speculative-simple/speculative-simple.cpp:49-81
Signature Pattern using llama_model_load_from_file(path, mparams)
Import #include "llama.h", #include "common.h"

Draft model loading pattern:

// load the draft model
llama_model_ptr model_dft;

{
    const auto & params_spec = params.speculative;

    auto params_dft = params;

    params_dft.n_parallel   = 1;
    params_dft.n_ctx        = params_spec.n_ctx;
    params_dft.n_batch      = llama_n_ctx_seq(ctx_tgt);
    params_dft.devices      = params_spec.devices;
    params_dft.model        = params_spec.mparams_dft;
    params_dft.n_gpu_layers = params_spec.n_gpu_layers;

    if (params_spec.cpuparams.n_threads > 0) {
        params_dft.cpuparams.n_threads       = params.speculative.cpuparams.n_threads;
        params_dft.cpuparams_batch.n_threads = params.speculative.cpuparams_batch.n_threads;
    }

    params_dft.tensor_buft_overrides = params.speculative.tensor_buft_overrides;

    auto mparams_dft = common_model_params_to_llama(params_dft);

    model_dft.reset(llama_model_load_from_file(params_dft.model.path.c_str(), mparams_dft));
    if (model_dft == nullptr) {
        LOG_ERR("failed to load draft model, '%s'\n", params_dft.model.path.c_str());
        return 1;
    }

    params.speculative.model_dft = model_dft.get();
    params.speculative.cparams_dft = common_context_params_to_llama(params_dft);
}

I/O Contract

Direction Name Type Description
Input params.speculative.mparams_dft.path std::string Path to the draft model GGUF file (--model-draft CLI flag)
Input params.speculative.n_ctx int32_t Draft model context size
Input params.speculative.n_gpu_layers int32_t GPU layers for draft model (-1 for default)
Input ctx_tgt llama_context * Target context (used for llama_n_ctx_seq() to set draft batch size)
Output model_dft llama_model_ptr Loaded draft model (smart pointer)
Output params.speculative.model_dft llama_model * Raw pointer stored for sharing across speculative contexts
Output params.speculative.cparams_dft llama_context_params Context parameters for creating draft inference contexts

Parameter overrides applied to draft model:

  • n_parallel = 1 (single sequence generation)
  • n_batch = llama_n_ctx_seq(ctx_tgt) (batch size from target per-sequence context)
  • Model path from params.speculative.mparams_dft
  • GPU layers from params.speculative.n_gpu_layers
  • Thread counts from params.speculative.cpuparams (if specified)
  • Device and tensor buffer overrides from speculative params

Usage Examples

Complete draft model loading in speculative-simple:

#include "common.h"
#include "llama.h"

common_params params;
common_params_parse(argc, argv, params, LLAMA_EXAMPLE_SPECULATIVE);

// Verify draft model is specified
if (params.speculative.mparams_dft.path.empty()) {
    fprintf(stderr, "--model-draft is required\n");
    return 1;
}

// Load target model first
auto llama_init_tgt = common_init_from_params(params);
llama_model * model_tgt = llama_init_tgt->model();
llama_context * ctx_tgt = llama_init_tgt->context();

// Load draft model with derived parameters
llama_model_ptr model_dft;
{
    auto params_dft = params;
    params_dft.n_parallel = 1;
    params_dft.n_ctx = params.speculative.n_ctx;
    params_dft.n_batch = llama_n_ctx_seq(ctx_tgt);
    params_dft.model = params.speculative.mparams_dft;
    params_dft.n_gpu_layers = params.speculative.n_gpu_layers;

    auto mparams_dft = common_model_params_to_llama(params_dft);
    model_dft.reset(llama_model_load_from_file(
        params_dft.model.path.c_str(), mparams_dft));

    params.speculative.model_dft = model_dft.get();
    params.speculative.cparams_dft = common_context_params_to_llama(params_dft);
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment