Implementation:Ggml org Llama cpp Llama Init From Model

Knowledge Sources	Domains	Last Updated
ggml-org/llama.cpp	Inference Context Initialization, KV Cache Allocation, Compute Graph Scheduling	2026-02-14

Overview

Description

llama_init_from_model creates an inference context from a previously loaded model. The context encapsulates all mutable runtime state needed for inference: the KV cache, batch processing buffers, backend scheduler, thread configuration, and performance counters. The returned context handle is used for all subsequent decode, encode, and sampling operations.

Multiple contexts can be created from the same model, allowing concurrent inference sessions that share model weights but maintain independent KV caches and generation state.

Usage

#include "llama.h"

// Assume model is already loaded
llama_context_params ctx_params = llama_context_default_params();
ctx_params.n_ctx   = 4096;     // context window of 4096 tokens
ctx_params.n_batch = 2048;     // process up to 2048 tokens per decode call
ctx_params.no_perf = false;    // enable performance counters

llama_context * ctx = llama_init_from_model(model, ctx_params);
if (ctx == NULL) {
    fprintf(stderr, "Failed to create context\n");
    return 1;
}

// Use ctx for decode, sampling, etc.

// Free when done (does not free the model)
llama_free(ctx);

Code Reference

Source Location

File	Line(s)	Type
`include/llama.h`	470-472	Declaration
`src/llama-context.cpp`	2973-3022	Implementation

Signature

LLAMA_API struct llama_context * llama_init_from_model(
                 struct llama_model * model,
        struct llama_context_params   params);

Import

#include "llama.h"

I/O Contract

Inputs

Parameter	Type	Description
`model`	`struct llama_model *`	Pointer to a loaded model (from `llama_model_load_from_file`). Must not be NULL.
`params`	`struct llama_context_params`	Configuration struct controlling context behavior. Key fields documented below.

llama_context_params fields (defined at include/llama.h:327-379):

Field	Type	Default	Description
`n_ctx`	`uint32_t`	0 (from model)	Text context size (maximum sequence length). 0 uses the model's training context size.
`n_batch`	`uint32_t`	2048	Logical maximum batch size that can be submitted to `llama_decode`.
`n_ubatch`	`uint32_t`	512	Physical maximum micro-batch size for internal processing.
`n_seq_max`	`uint32_t`	1	Maximum number of concurrent sequences.
`n_threads`	`int32_t`	GGML_DEFAULT_N_THREADS	Threads for single-token generation.
`n_threads_batch`	`int32_t`	GGML_DEFAULT_N_THREADS	Threads for batch (prompt) processing.
`rope_scaling_type`	`enum llama_rope_scaling_type`	from model	RoPE scaling type.
`pooling_type`	`enum llama_pooling_type`	from model	Pooling type for embedding results.
`attention_type`	`enum llama_attention_type`	from model	Attention type for embeddings.
`flash_attn_type`	`enum llama_flash_attn_type`	AUTO	When to enable Flash Attention.
`rope_freq_base`	`float`	0 (from model)	RoPE base frequency.
`rope_freq_scale`	`float`	0 (from model)	RoPE frequency scaling factor.
`yarn_ext_factor`	`float`	negative (from model)	YaRN extrapolation mix factor.
`yarn_attn_factor`	`float`	0	YaRN magnitude scaling factor.
`yarn_beta_fast`	`float`	0	YaRN low correction dim.
`yarn_beta_slow`	`float`	0	YaRN high correction dim.
`yarn_orig_ctx`	`uint32_t`	0	YaRN original context size.
`defrag_thold`	`float`	0	[DEPRECATED] KV cache defrag threshold.
`type_k`	`enum ggml_type`	GGML_TYPE_F16	Data type for K cache. [EXPERIMENTAL]
`type_v`	`enum ggml_type`	GGML_TYPE_F16	Data type for V cache. [EXPERIMENTAL]
`embeddings`	`bool`	false	Extract embeddings together with logits.
`offload_kqv`	`bool`	true	Offload KQV ops (including KV cache) to GPU.
`no_perf`	`bool`	true	Disable performance timing measurements.
`op_offload`	`bool`	true	Offload host tensor operations to device.
`swa_full`	`bool`	true	Use full-size sliding window attention cache.
`kv_unified`	`bool`	true	Use a unified buffer across input sequences for attention.

Outputs

Return	Type	Description
context handle	`struct llama_context *`	Opaque pointer to the inference context. Returns NULL if: model is NULL, both n_batch and n_ubatch are zero, both n_ctx and model training context are zero, V cache quantization is used without Flash Attention, or an internal allocation failure occurs.

Validation Rules

The function performs the following validation before creating the context:

model must not be NULL
n_batch and n_ubatch cannot both be zero
n_ctx and the model's n_ctx_train cannot both be zero
Quantized V cache types require Flash Attention to be enabled
Quantized K/V cache block sizes must evenly divide the head embedding dimensions
Flash Attention is forcibly disabled for the Grok architecture

Usage Examples

From examples/simple/simple.cpp

// Initialize the context
llama_context_params ctx_params = llama_context_default_params();

// Set context size to fit the prompt plus predicted tokens
ctx_params.n_ctx = n_prompt + n_predict - 1;

// Set batch size to the prompt length for efficient prompt processing
ctx_params.n_batch = n_prompt;

// Enable performance counters
ctx_params.no_perf = false;

llama_context * ctx = llama_init_from_model(model, ctx_params);
if (ctx == NULL) {
    fprintf(stderr, "error: failed to create the llama_context\n");
    return 1;
}

Context with Quantized KV Cache

llama_context_params params = llama_context_default_params();
params.n_ctx = 8192;
params.type_k = GGML_TYPE_Q8_0;   // quantize K cache to Q8_0
params.type_v = GGML_TYPE_Q8_0;   // quantize V cache to Q8_0
// Note: quantized V cache requires flash attention
// flash_attn_type defaults to AUTO, which will enable it

llama_context * ctx = llama_init_from_model(model, params);

Multiple Contexts Sharing a Model

// Create two independent inference contexts from the same model
llama_context_params params = llama_context_default_params();
params.n_ctx = 2048;

llama_context * ctx1 = llama_init_from_model(model, params);
llama_context * ctx2 = llama_init_from_model(model, params);

// ctx1 and ctx2 share model weights but have independent KV caches

llama_free(ctx1);
llama_free(ctx2);
// Model is still valid and can create more contexts

Related Pages

Principle:Ggml_org_Llama_cpp_Inference_Context_Creation
Implementation:Ggml_org_Llama_cpp_Llama_Model_Load_From_File -- model must be loaded first
Implementation:Ggml_org_Llama_cpp_Llama_Decode -- uses the context for batch decoding
Heuristic:Ggml_org_Llama_cpp_Context_Size_Alignment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment