Implementation:Ollama Ollama Llama Context
| Knowledge Sources | |
|---|---|
| Domains | Inference, Runtime |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the llama_context class, which manages the complete inference lifecycle including memory allocation, compute graph execution, batch decoding, state serialization, and embeddings extraction.
Description
The constructor initializes context parameters (n_ctx, n_batch, n_ubatch, rope settings, pooling type, flash attention), creates the memory system (KV cache or recurrent state depending on architecture), reserves compute buffers, and sets up the backend scheduler. The decode method processes token batches by splitting them into micro-batches, building compute graphs via process_ubatch, running them through the backend scheduler, and extracting logits/embeddings. Provides state save/load via I/O adapter classes for session persistence. Manages LoRA adapter application, threadpool attachment, memory updates (defragmentation/optimization), and output extraction. Also implements encode for encoder-decoder models.
Usage
This is the central runtime component of llama.cpp. Every inference request flows through llama_context, making it the hub that connects the model, memory, compute backend, and user-facing API.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/src/llama-context.cpp
- Lines: 1-3056
Signature
llama_context::llama_context(
const llama_model & model,
llama_context_params params);
~llama_context();
int encode(const llama_batch & batch_inp);
int decode(const llama_batch & batch_inp);
void synchronize();
float * get_logits();
float * get_logits_ith(int32_t i);
float * get_embeddings();
float * get_embeddings_ith(int32_t i);
llm_graph_result * process_ubatch(
const llama_ubatch & ubatch,
llm_graph_type gtype,
llama_memory_context_i * mctx,
ggml_status & ret);
size_t state_get_size();
size_t state_get_data(uint8_t * dst, size_t size);
size_t state_set_data(const uint8_t * src, size_t size);
Import
#include "llama-context.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | Loaded model to create context for |
| params | llama_context_params | Yes | Context parameters (n_ctx, n_batch, threads, etc.) |
| batch_inp | const llama_batch & | Yes | Input batch of tokens for encode/decode |
Outputs
| Name | Type | Description |
|---|---|---|
| logits | float * | Output logits array [n_vocab] for each output position |
| embeddings | float * | Output embeddings array [n_embd] for each output position |
| status | int | 0 on success, negative on failure |
Usage Examples
#include "llama-context.h"
// Create context (normally done via llama_init_from_model)
llama_context_params params = llama_context_default_params();
params.n_ctx = 4096;
params.n_batch = 512;
auto * ctx = llama_init_from_model(model, params);
// Decode tokens
llama_batch batch = llama_batch_get_one(tokens.data(), tokens.size());
int status = llama_decode(ctx, batch);
// Extract logits
float * logits = llama_get_logits_ith(ctx, -1);
// Save/restore state
size_t state_size = llama_state_get_size(ctx);
std::vector<uint8_t> state(state_size);
llama_state_get_data(ctx, state.data(), state_size);