Implementation:Ollama Ollama Llama Context Header

Knowledge Sources	Ollama
Domains	Inference, Runtime
Last Updated	2025-02-15 00:00 GMT

Overview

Header declaring the llama_context class, which is the primary runtime state container for all inference operations in llama.cpp.

Description

Declares the llama_context struct with methods for initialization from model and parameters, synchronization, accessor methods (model, cparams, scheduler, dimensions), memory management (get_memory, memory_update), decoding (decode, encode), logits/embeddings extraction, threadpool management, LoRA adapter control, state save/load, performance tracking, and training support. Also defines llama_memory_breakdown_data for tracking memory usage across model, context, and compute buffers. Contains internal members for the batch allocator, compute graph results, backend scheduler, output buffers, and timing statistics.

Usage

Include this header when working with the llama_context internals. All public llama API functions that take a llama_context* parameter operate on the struct defined here.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/src/llama-context.h
Lines: 1-318

Signature

struct llama_memory_breakdown_data {
    size_t model   = 0;
    size_t context = 0;
    size_t compute = 0;
    size_t total() const;
};

struct llama_context {
    llama_context(const llama_model & model, llama_context_params params);
    ~llama_context();

    void synchronize();
    const llama_model & get_model() const;
    const llama_cparams & get_cparams() const;

    uint32_t n_ctx()     const;
    uint32_t n_batch()   const;
    uint32_t n_ubatch()  const;
    uint32_t n_seq_max() const;

    llama_memory_t get_memory() const;
    bool memory_update(bool optimize);

    float * get_logits();
    float * get_logits_ith(int32_t i);
    float * get_embeddings();

    int encode(const llama_batch & batch_inp);
    int decode(const llama_batch & batch_inp);

    void set_adapter_lora(llama_adapter_lora * adapter, float scale);
    bool rm_adapter_lora(llama_adapter_lora * adapter);

    llama_perf_context_data perf_get_data() const;
    void perf_reset();
};

Import

#include "llama-context.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	The loaded model
params	llama_context_params	Yes	Context configuration (n_ctx, threads, etc.)
adapter	llama_adapter_lora *	No	LoRA adapter to attach
scale	float	No	Scale factor for LoRA adapter

Outputs

Name	Type	Description
logits	float *	Output logits for sampled positions
embeddings	float *	Output embeddings for sampled positions
memory	llama_memory_t	Memory handle (KV cache or recurrent state)
perf_data	llama_perf_context_data	Performance timing data

Usage Examples

#include "llama-context.h"

// Access context properties
uint32_t ctx_size = ctx->n_ctx();
uint32_t batch_size = ctx->n_batch();

// Get memory breakdown
auto breakdown = ctx->memory_breakdown();
for (auto & [buft, data] : breakdown) {
    printf("model: %zu, context: %zu, compute: %zu\n",
           data.model, data.context, data.compute);
}

// Performance tracking
auto perf = ctx->perf_get_data();
ctx->perf_reset();

Related Pages

Principle:Ollama_Ollama_Inference_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment