Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Llama Init From Model

From Leeroopedia
Knowledge Sources Domains Last Updated
ggml-org/llama.cpp Inference Context Initialization, KV Cache Allocation, Compute Graph Scheduling 2026-02-14

Overview

Description

llama_init_from_model creates an inference context from a previously loaded model. The context encapsulates all mutable runtime state needed for inference: the KV cache, batch processing buffers, backend scheduler, thread configuration, and performance counters. The returned context handle is used for all subsequent decode, encode, and sampling operations.

Multiple contexts can be created from the same model, allowing concurrent inference sessions that share model weights but maintain independent KV caches and generation state.

Usage

#include "llama.h"

// Assume model is already loaded
llama_context_params ctx_params = llama_context_default_params();
ctx_params.n_ctx   = 4096;     // context window of 4096 tokens
ctx_params.n_batch = 2048;     // process up to 2048 tokens per decode call
ctx_params.no_perf = false;    // enable performance counters

llama_context * ctx = llama_init_from_model(model, ctx_params);
if (ctx == NULL) {
    fprintf(stderr, "Failed to create context\n");
    return 1;
}

// Use ctx for decode, sampling, etc.

// Free when done (does not free the model)
llama_free(ctx);

Code Reference

Source Location

File Line(s) Type
include/llama.h 470-472 Declaration
src/llama-context.cpp 2973-3022 Implementation

Signature

LLAMA_API struct llama_context * llama_init_from_model(
                 struct llama_model * model,
        struct llama_context_params   params);

Import

#include "llama.h"

I/O Contract

Inputs

Parameter Type Description
model struct llama_model * Pointer to a loaded model (from llama_model_load_from_file). Must not be NULL.
params struct llama_context_params Configuration struct controlling context behavior. Key fields documented below.

llama_context_params fields (defined at include/llama.h:327-379):

Field Type Default Description
n_ctx uint32_t 0 (from model) Text context size (maximum sequence length). 0 uses the model's training context size.
n_batch uint32_t 2048 Logical maximum batch size that can be submitted to llama_decode.
n_ubatch uint32_t 512 Physical maximum micro-batch size for internal processing.
n_seq_max uint32_t 1 Maximum number of concurrent sequences.
n_threads int32_t GGML_DEFAULT_N_THREADS Threads for single-token generation.
n_threads_batch int32_t GGML_DEFAULT_N_THREADS Threads for batch (prompt) processing.
rope_scaling_type enum llama_rope_scaling_type from model RoPE scaling type.
pooling_type enum llama_pooling_type from model Pooling type for embedding results.
attention_type enum llama_attention_type from model Attention type for embeddings.
flash_attn_type enum llama_flash_attn_type AUTO When to enable Flash Attention.
rope_freq_base float 0 (from model) RoPE base frequency.
rope_freq_scale float 0 (from model) RoPE frequency scaling factor.
yarn_ext_factor float negative (from model) YaRN extrapolation mix factor.
yarn_attn_factor float 0 YaRN magnitude scaling factor.
yarn_beta_fast float 0 YaRN low correction dim.
yarn_beta_slow float 0 YaRN high correction dim.
yarn_orig_ctx uint32_t 0 YaRN original context size.
defrag_thold float 0 [DEPRECATED] KV cache defrag threshold.
type_k enum ggml_type GGML_TYPE_F16 Data type for K cache. [EXPERIMENTAL]
type_v enum ggml_type GGML_TYPE_F16 Data type for V cache. [EXPERIMENTAL]
embeddings bool false Extract embeddings together with logits.
offload_kqv bool true Offload KQV ops (including KV cache) to GPU.
no_perf bool true Disable performance timing measurements.
op_offload bool true Offload host tensor operations to device.
swa_full bool true Use full-size sliding window attention cache.
kv_unified bool true Use a unified buffer across input sequences for attention.

Outputs

Return Type Description
context handle struct llama_context * Opaque pointer to the inference context. Returns NULL if: model is NULL, both n_batch and n_ubatch are zero, both n_ctx and model training context are zero, V cache quantization is used without Flash Attention, or an internal allocation failure occurs.

Validation Rules

The function performs the following validation before creating the context:

  • model must not be NULL
  • n_batch and n_ubatch cannot both be zero
  • n_ctx and the model's n_ctx_train cannot both be zero
  • Quantized V cache types require Flash Attention to be enabled
  • Quantized K/V cache block sizes must evenly divide the head embedding dimensions
  • Flash Attention is forcibly disabled for the Grok architecture

Usage Examples

From examples/simple/simple.cpp

// Initialize the context
llama_context_params ctx_params = llama_context_default_params();

// Set context size to fit the prompt plus predicted tokens
ctx_params.n_ctx = n_prompt + n_predict - 1;

// Set batch size to the prompt length for efficient prompt processing
ctx_params.n_batch = n_prompt;

// Enable performance counters
ctx_params.no_perf = false;

llama_context * ctx = llama_init_from_model(model, ctx_params);
if (ctx == NULL) {
    fprintf(stderr, "error: failed to create the llama_context\n");
    return 1;
}

Context with Quantized KV Cache

llama_context_params params = llama_context_default_params();
params.n_ctx = 8192;
params.type_k = GGML_TYPE_Q8_0;   // quantize K cache to Q8_0
params.type_v = GGML_TYPE_Q8_0;   // quantize V cache to Q8_0
// Note: quantized V cache requires flash attention
// flash_attn_type defaults to AUTO, which will enable it

llama_context * ctx = llama_init_from_model(model, params);

Multiple Contexts Sharing a Model

// Create two independent inference contexts from the same model
llama_context_params params = llama_context_default_params();
params.n_ctx = 2048;

llama_context * ctx1 = llama_init_from_model(model, params);
llama_context * ctx2 = llama_init_from_model(model, params);

// ctx1 and ctx2 share model weights but have independent KV caches

llama_free(ctx1);
llama_free(ctx2);
// Model is still valid and can create more contexts

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment