Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ollama Ollama Llama Context

From Leeroopedia
Knowledge Sources
Domains Inference, Runtime
Last Updated 2025-02-15 00:00 GMT

Overview

Implements the llama_context class, which manages the complete inference lifecycle including memory allocation, compute graph execution, batch decoding, state serialization, and embeddings extraction.

Description

The constructor initializes context parameters (n_ctx, n_batch, n_ubatch, rope settings, pooling type, flash attention), creates the memory system (KV cache or recurrent state depending on architecture), reserves compute buffers, and sets up the backend scheduler. The decode method processes token batches by splitting them into micro-batches, building compute graphs via process_ubatch, running them through the backend scheduler, and extracting logits/embeddings. Provides state save/load via I/O adapter classes for session persistence. Manages LoRA adapter application, threadpool attachment, memory updates (defragmentation/optimization), and output extraction. Also implements encode for encoder-decoder models.

Usage

This is the central runtime component of llama.cpp. Every inference request flows through llama_context, making it the hub that connects the model, memory, compute backend, and user-facing API.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/src/llama-context.cpp
  • Lines: 1-3056

Signature

llama_context::llama_context(
    const llama_model & model,
    llama_context_params params);

~llama_context();

int encode(const llama_batch & batch_inp);
int decode(const llama_batch & batch_inp);

void synchronize();

float * get_logits();
float * get_logits_ith(int32_t i);
float * get_embeddings();
float * get_embeddings_ith(int32_t i);

llm_graph_result * process_ubatch(
    const llama_ubatch & ubatch,
    llm_graph_type gtype,
    llama_memory_context_i * mctx,
    ggml_status & ret);

size_t state_get_size();
size_t state_get_data(uint8_t * dst, size_t size);
size_t state_set_data(const uint8_t * src, size_t size);

Import

#include "llama-context.h"

I/O Contract

Inputs

Name Type Required Description
model const llama_model & Yes Loaded model to create context for
params llama_context_params Yes Context parameters (n_ctx, n_batch, threads, etc.)
batch_inp const llama_batch & Yes Input batch of tokens for encode/decode

Outputs

Name Type Description
logits float * Output logits array [n_vocab] for each output position
embeddings float * Output embeddings array [n_embd] for each output position
status int 0 on success, negative on failure

Usage Examples

#include "llama-context.h"

// Create context (normally done via llama_init_from_model)
llama_context_params params = llama_context_default_params();
params.n_ctx = 4096;
params.n_batch = 512;
auto * ctx = llama_init_from_model(model, params);

// Decode tokens
llama_batch batch = llama_batch_get_one(tokens.data(), tokens.size());
int status = llama_decode(ctx, batch);

// Extract logits
float * logits = llama_get_logits_ith(ctx, -1);

// Save/restore state
size_t state_size = llama_state_get_size(ctx);
std::vector<uint8_t> state(state_size);
llama_state_get_data(ctx, state.data(), state_size);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment