Implementation:Ollama Ollama Llama Context Header
| Knowledge Sources | |
|---|---|
| Domains | Inference, Runtime |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Header declaring the llama_context class, which is the primary runtime state container for all inference operations in llama.cpp.
Description
Declares the llama_context struct with methods for initialization from model and parameters, synchronization, accessor methods (model, cparams, scheduler, dimensions), memory management (get_memory, memory_update), decoding (decode, encode), logits/embeddings extraction, threadpool management, LoRA adapter control, state save/load, performance tracking, and training support. Also defines llama_memory_breakdown_data for tracking memory usage across model, context, and compute buffers. Contains internal members for the batch allocator, compute graph results, backend scheduler, output buffers, and timing statistics.
Usage
Include this header when working with the llama_context internals. All public llama API functions that take a llama_context* parameter operate on the struct defined here.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/src/llama-context.h
- Lines: 1-318
Signature
struct llama_memory_breakdown_data {
size_t model = 0;
size_t context = 0;
size_t compute = 0;
size_t total() const;
};
struct llama_context {
llama_context(const llama_model & model, llama_context_params params);
~llama_context();
void synchronize();
const llama_model & get_model() const;
const llama_cparams & get_cparams() const;
uint32_t n_ctx() const;
uint32_t n_batch() const;
uint32_t n_ubatch() const;
uint32_t n_seq_max() const;
llama_memory_t get_memory() const;
bool memory_update(bool optimize);
float * get_logits();
float * get_logits_ith(int32_t i);
float * get_embeddings();
int encode(const llama_batch & batch_inp);
int decode(const llama_batch & batch_inp);
void set_adapter_lora(llama_adapter_lora * adapter, float scale);
bool rm_adapter_lora(llama_adapter_lora * adapter);
llama_perf_context_data perf_get_data() const;
void perf_reset();
};
Import
#include "llama-context.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | The loaded model |
| params | llama_context_params | Yes | Context configuration (n_ctx, threads, etc.) |
| adapter | llama_adapter_lora * | No | LoRA adapter to attach |
| scale | float | No | Scale factor for LoRA adapter |
Outputs
| Name | Type | Description |
|---|---|---|
| logits | float * | Output logits for sampled positions |
| embeddings | float * | Output embeddings for sampled positions |
| memory | llama_memory_t | Memory handle (KV cache or recurrent state) |
| perf_data | llama_perf_context_data | Performance timing data |
Usage Examples
#include "llama-context.h"
// Access context properties
uint32_t ctx_size = ctx->n_ctx();
uint32_t batch_size = ctx->n_batch();
// Get memory breakdown
auto breakdown = ctx->memory_breakdown();
for (auto & [buft, data] : breakdown) {
printf("model: %zu, context: %zu, compute: %zu\n",
data.model, data.context, data.compute);
}
// Performance tracking
auto perf = ctx->perf_get_data();
ctx->perf_reset();