| Knowledge Sources |
Domains |
Last Updated
|
| ggml-org/llama.cpp |
Inference Context Initialization, KV Cache Allocation, Compute Graph Scheduling |
2026-02-14
|
Overview
Description
llama_init_from_model creates an inference context from a previously loaded model. The context encapsulates all mutable runtime state needed for inference: the KV cache, batch processing buffers, backend scheduler, thread configuration, and performance counters. The returned context handle is used for all subsequent decode, encode, and sampling operations.
Multiple contexts can be created from the same model, allowing concurrent inference sessions that share model weights but maintain independent KV caches and generation state.
Usage
#include "llama.h"
// Assume model is already loaded
llama_context_params ctx_params = llama_context_default_params();
ctx_params.n_ctx = 4096; // context window of 4096 tokens
ctx_params.n_batch = 2048; // process up to 2048 tokens per decode call
ctx_params.no_perf = false; // enable performance counters
llama_context * ctx = llama_init_from_model(model, ctx_params);
if (ctx == NULL) {
fprintf(stderr, "Failed to create context\n");
return 1;
}
// Use ctx for decode, sampling, etc.
// Free when done (does not free the model)
llama_free(ctx);
Code Reference
Source Location
| File |
Line(s) |
Type
|
include/llama.h |
470-472 |
Declaration
|
src/llama-context.cpp |
2973-3022 |
Implementation
|
Signature
LLAMA_API struct llama_context * llama_init_from_model(
struct llama_model * model,
struct llama_context_params params);
Import
I/O Contract
Inputs
| Parameter |
Type |
Description
|
model |
struct llama_model * |
Pointer to a loaded model (from llama_model_load_from_file). Must not be NULL.
|
params |
struct llama_context_params |
Configuration struct controlling context behavior. Key fields documented below.
|
llama_context_params fields (defined at include/llama.h:327-379):
| Field |
Type |
Default |
Description
|
n_ctx |
uint32_t |
0 (from model) |
Text context size (maximum sequence length). 0 uses the model's training context size.
|
n_batch |
uint32_t |
2048 |
Logical maximum batch size that can be submitted to llama_decode.
|
n_ubatch |
uint32_t |
512 |
Physical maximum micro-batch size for internal processing.
|
n_seq_max |
uint32_t |
1 |
Maximum number of concurrent sequences.
|
n_threads |
int32_t |
GGML_DEFAULT_N_THREADS |
Threads for single-token generation.
|
n_threads_batch |
int32_t |
GGML_DEFAULT_N_THREADS |
Threads for batch (prompt) processing.
|
rope_scaling_type |
enum llama_rope_scaling_type |
from model |
RoPE scaling type.
|
pooling_type |
enum llama_pooling_type |
from model |
Pooling type for embedding results.
|
attention_type |
enum llama_attention_type |
from model |
Attention type for embeddings.
|
flash_attn_type |
enum llama_flash_attn_type |
AUTO |
When to enable Flash Attention.
|
rope_freq_base |
float |
0 (from model) |
RoPE base frequency.
|
rope_freq_scale |
float |
0 (from model) |
RoPE frequency scaling factor.
|
yarn_ext_factor |
float |
negative (from model) |
YaRN extrapolation mix factor.
|
yarn_attn_factor |
float |
0 |
YaRN magnitude scaling factor.
|
yarn_beta_fast |
float |
0 |
YaRN low correction dim.
|
yarn_beta_slow |
float |
0 |
YaRN high correction dim.
|
yarn_orig_ctx |
uint32_t |
0 |
YaRN original context size.
|
defrag_thold |
float |
0 |
[DEPRECATED] KV cache defrag threshold.
|
type_k |
enum ggml_type |
GGML_TYPE_F16 |
Data type for K cache. [EXPERIMENTAL]
|
type_v |
enum ggml_type |
GGML_TYPE_F16 |
Data type for V cache. [EXPERIMENTAL]
|
embeddings |
bool |
false |
Extract embeddings together with logits.
|
offload_kqv |
bool |
true |
Offload KQV ops (including KV cache) to GPU.
|
no_perf |
bool |
true |
Disable performance timing measurements.
|
op_offload |
bool |
true |
Offload host tensor operations to device.
|
swa_full |
bool |
true |
Use full-size sliding window attention cache.
|
kv_unified |
bool |
true |
Use a unified buffer across input sequences for attention.
|
Outputs
| Return |
Type |
Description
|
| context handle |
struct llama_context * |
Opaque pointer to the inference context. Returns NULL if: model is NULL, both n_batch and n_ubatch are zero, both n_ctx and model training context are zero, V cache quantization is used without Flash Attention, or an internal allocation failure occurs.
|
Validation Rules
The function performs the following validation before creating the context:
model must not be NULL
n_batch and n_ubatch cannot both be zero
n_ctx and the model's n_ctx_train cannot both be zero
- Quantized V cache types require Flash Attention to be enabled
- Quantized K/V cache block sizes must evenly divide the head embedding dimensions
- Flash Attention is forcibly disabled for the Grok architecture
Usage Examples
From examples/simple/simple.cpp
// Initialize the context
llama_context_params ctx_params = llama_context_default_params();
// Set context size to fit the prompt plus predicted tokens
ctx_params.n_ctx = n_prompt + n_predict - 1;
// Set batch size to the prompt length for efficient prompt processing
ctx_params.n_batch = n_prompt;
// Enable performance counters
ctx_params.no_perf = false;
llama_context * ctx = llama_init_from_model(model, ctx_params);
if (ctx == NULL) {
fprintf(stderr, "error: failed to create the llama_context\n");
return 1;
}
Context with Quantized KV Cache
llama_context_params params = llama_context_default_params();
params.n_ctx = 8192;
params.type_k = GGML_TYPE_Q8_0; // quantize K cache to Q8_0
params.type_v = GGML_TYPE_Q8_0; // quantize V cache to Q8_0
// Note: quantized V cache requires flash attention
// flash_attn_type defaults to AUTO, which will enable it
llama_context * ctx = llama_init_from_model(model, params);
Multiple Contexts Sharing a Model
// Create two independent inference contexts from the same model
llama_context_params params = llama_context_default_params();
params.n_ctx = 2048;
llama_context * ctx1 = llama_init_from_model(model, params);
llama_context * ctx2 = llama_init_from_model(model, params);
// ctx1 and ctx2 share model weights but have independent KV caches
llama_free(ctx1);
llama_free(ctx2);
// Model is still valid and can create more contexts
Related Pages