Implementation:Ollama Ollama Llama Context

Knowledge Sources	Ollama
Domains	Inference, Runtime
Last Updated	2025-02-15 00:00 GMT

Overview

Implements the llama_context class, which manages the complete inference lifecycle including memory allocation, compute graph execution, batch decoding, state serialization, and embeddings extraction.

Description

The constructor initializes context parameters (n_ctx, n_batch, n_ubatch, rope settings, pooling type, flash attention), creates the memory system (KV cache or recurrent state depending on architecture), reserves compute buffers, and sets up the backend scheduler. The decode method processes token batches by splitting them into micro-batches, building compute graphs via process_ubatch, running them through the backend scheduler, and extracting logits/embeddings. Provides state save/load via I/O adapter classes for session persistence. Manages LoRA adapter application, threadpool attachment, memory updates (defragmentation/optimization), and output extraction. Also implements encode for encoder-decoder models.

Usage

This is the central runtime component of llama.cpp. Every inference request flows through llama_context, making it the hub that connects the model, memory, compute backend, and user-facing API.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/src/llama-context.cpp
Lines: 1-3056

Signature

llama_context::llama_context(
    const llama_model & model,
    llama_context_params params);

~llama_context();

int encode(const llama_batch & batch_inp);
int decode(const llama_batch & batch_inp);

void synchronize();

float * get_logits();
float * get_logits_ith(int32_t i);
float * get_embeddings();
float * get_embeddings_ith(int32_t i);

llm_graph_result * process_ubatch(
    const llama_ubatch & ubatch,
    llm_graph_type gtype,
    llama_memory_context_i * mctx,
    ggml_status & ret);

size_t state_get_size();
size_t state_get_data(uint8_t * dst, size_t size);
size_t state_set_data(const uint8_t * src, size_t size);

Import

#include "llama-context.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	Loaded model to create context for
params	llama_context_params	Yes	Context parameters (n_ctx, n_batch, threads, etc.)
batch_inp	const llama_batch &	Yes	Input batch of tokens for encode/decode

Outputs

Name	Type	Description
logits	float *	Output logits array [n_vocab] for each output position
embeddings	float *	Output embeddings array [n_embd] for each output position
status	int	0 on success, negative on failure

Usage Examples

#include "llama-context.h"

// Create context (normally done via llama_init_from_model)
llama_context_params params = llama_context_default_params();
params.n_ctx = 4096;
params.n_batch = 512;
auto * ctx = llama_init_from_model(model, params);

// Decode tokens
llama_batch batch = llama_batch_get_one(tokens.data(), tokens.size());
int status = llama_decode(ctx, batch);

// Extract logits
float * logits = llama_get_logits_ith(ctx, -1);

// Save/restore state
size_t state_size = llama_state_get_size(ctx);
std::vector<uint8_t> state(state_size);
llama_state_get_data(ctx, state.data(), state_size);

Related Pages

Principle:Ollama_Ollama_Inference_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment