Implementation:Ollama Ollama Llama Public API

Knowledge Sources	Ollama
Domains	Inference, API
Last Updated	2025-02-15 00:00 GMT

Overview

The primary public C API header for the llama.cpp library, defining all data types, enumerations, and function declarations for model loading, inference, tokenization, and sampling.

Description

Declares the core opaque types (llama_model, llama_context, llama_vocab, llama_sampler), type aliases (llama_token, llama_pos, llama_seq_id), and comprehensive enumerations for vocabulary types (SPM, BPE, WPM, UGM, RWKV), RoPE types, token types/attributes, model file types, split modes, and pooling types. Defines parameter structs (llama_model_params, llama_context_params, llama_batch). Declares the full API surface: backend init/free, model load/free, context creation, batch decode, tokenize/detokenize, sampler chain construction, state save/load, LoRA adapters, embeddings, and performance counters.

Usage

Every consumer of the llama.cpp library -- including Ollama's Go bindings, example programs, and third-party integrations -- includes this header. It defines the stable C ABI contract.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/include/llama.h
Lines: 1-1449

Signature

// Core types
typedef int32_t llama_pos;
typedef int32_t llama_token;
typedef int32_t llama_seq_id;

// Model lifecycle
LLAMA_API struct llama_model * llama_model_load_from_file(const char * path_model,
                                                          struct llama_model_params params);
LLAMA_API void llama_model_free(struct llama_model * model);

// Context lifecycle
LLAMA_API struct llama_context * llama_init_from_model(struct llama_model * model,
                                                       struct llama_context_params params);
LLAMA_API void llama_free(struct llama_context * ctx);

// Inference
LLAMA_API int32_t llama_encode(struct llama_context * ctx, struct llama_batch batch);
LLAMA_API int32_t llama_decode(struct llama_context * ctx, struct llama_batch batch);

// Accessors
LLAMA_API uint32_t llama_n_ctx(const struct llama_context * ctx);
LLAMA_API float * llama_get_logits(struct llama_context * ctx);
LLAMA_API float * llama_get_embeddings(struct llama_context * ctx);

// Default parameters
LLAMA_API struct llama_model_params   llama_model_default_params(void);
LLAMA_API struct llama_context_params llama_context_default_params(void);

Import

#include "llama.h"

I/O Contract

Inputs

Name	Type	Required	Description
path_model	const char *	Yes	Path to the GGUF model file
params	llama_model_params	Yes	Model loading parameters (GPU layers, split mode, etc.)
batch	llama_batch	Yes	Input batch of tokens/embeddings for encode/decode

Outputs

Name	Type	Description
model	llama_model *	Loaded model handle
ctx	llama_context *	Inference context handle
logits	float *	Pointer to output logits array [n_vocab]
embeddings	float *	Pointer to output embeddings array [n_embd]

Usage Examples

#include "llama.h"

// Initialize backend
llama_backend_init();

// Load model
auto mparams = llama_model_default_params();
mparams.n_gpu_layers = 35;
auto * model = llama_model_load_from_file("model.gguf", mparams);

// Create context
auto cparams = llama_context_default_params();
cparams.n_ctx = 4096;
auto * ctx = llama_init_from_model(model, cparams);

// Decode a batch
llama_batch batch = llama_batch_get_one(tokens, n_tokens);
llama_decode(ctx, batch);

// Get logits
float * logits = llama_get_logits(ctx);

// Cleanup
llama_free(ctx);
llama_model_free(model);
llama_backend_free();

Related Pages

Principle:Ollama_Ollama_Llama_Cpp_Integration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment