Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Llama Context Header

From Leeroopedia
Knowledge Sources
Domains Inference, Runtime
Last Updated 2025-02-15 00:00 GMT

Overview

Header declaring the llama_context class, which is the primary runtime state container for all inference operations in llama.cpp.

Description

Declares the llama_context struct with methods for initialization from model and parameters, synchronization, accessor methods (model, cparams, scheduler, dimensions), memory management (get_memory, memory_update), decoding (decode, encode), logits/embeddings extraction, threadpool management, LoRA adapter control, state save/load, performance tracking, and training support. Also defines llama_memory_breakdown_data for tracking memory usage across model, context, and compute buffers. Contains internal members for the batch allocator, compute graph results, backend scheduler, output buffers, and timing statistics.

Usage

Include this header when working with the llama_context internals. All public llama API functions that take a llama_context* parameter operate on the struct defined here.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/src/llama-context.h
  • Lines: 1-318

Signature

struct llama_memory_breakdown_data {
    size_t model   = 0;
    size_t context = 0;
    size_t compute = 0;
    size_t total() const;
};

struct llama_context {
    llama_context(const llama_model & model, llama_context_params params);
    ~llama_context();

    void synchronize();
    const llama_model & get_model() const;
    const llama_cparams & get_cparams() const;

    uint32_t n_ctx()     const;
    uint32_t n_batch()   const;
    uint32_t n_ubatch()  const;
    uint32_t n_seq_max() const;

    llama_memory_t get_memory() const;
    bool memory_update(bool optimize);

    float * get_logits();
    float * get_logits_ith(int32_t i);
    float * get_embeddings();

    int encode(const llama_batch & batch_inp);
    int decode(const llama_batch & batch_inp);

    void set_adapter_lora(llama_adapter_lora * adapter, float scale);
    bool rm_adapter_lora(llama_adapter_lora * adapter);

    llama_perf_context_data perf_get_data() const;
    void perf_reset();
};

Import

#include "llama-context.h"

I/O Contract

Inputs

Name Type Required Description
model const llama_model & Yes The loaded model
params llama_context_params Yes Context configuration (n_ctx, threads, etc.)
adapter llama_adapter_lora * No LoRA adapter to attach
scale float No Scale factor for LoRA adapter

Outputs

Name Type Description
logits float * Output logits for sampled positions
embeddings float * Output embeddings for sampled positions
memory llama_memory_t Memory handle (KV cache or recurrent state)
perf_data llama_perf_context_data Performance timing data

Usage Examples

#include "llama-context.h"

// Access context properties
uint32_t ctx_size = ctx->n_ctx();
uint32_t batch_size = ctx->n_batch();

// Get memory breakdown
auto breakdown = ctx->memory_breakdown();
for (auto & [buft, data] : breakdown) {
    printf("model: %zu, context: %zu, compute: %zu\n",
           data.model, data.context, data.compute);
}

// Performance tracking
auto perf = ctx->perf_get_data();
ctx->perf_reset();

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment