Implementation:Ollama Ollama Llama Model

Knowledge Sources	Ollama
Domains	LLM Inference, Model Loading
Last Updated	2025-02-15 00:00 GMT

Overview

Implements the llama_model class, including model loading from GGUF files, hyperparameter initialization, tensor creation, buffer allocation, and memory system initialization for all supported architectures.

Description

The load_hparams function reads hyperparameters from GGUF metadata for each architecture. load_tensors creates the model's ggml tensors, allocates backend buffers (with GPU offloading based on n_gpu_layers), and loads tensor data. init_memory creates the appropriate memory system (KV cache, ISWA cache, recurrent memory, or hybrid memory) based on the model architecture. Contains per-architecture tensor layout definitions and layer device assignment logic for multi-GPU tensor parallelism.

Usage

This is the largest and most critical file in llama.cpp's source. It is the single place where all model architectures are loaded and initialized. Every model that Ollama serves is loaded through this file's logic.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/src/llama-model.cpp
Lines: 1-8060

Signature

const char * llm_type_name(llm_type type);
static const char * llama_expert_gating_func_name(llama_expert_gating_func_type type);

// Key model operations (defined in llama-model.h, implemented here):
// llama_model::load_hparams(llama_model_loader & ml)
// llama_model::load_tensors(llama_model_loader & ml)
// llama_model::init_memory(const llama_context_params & cparams)
// llama_model::build_graph(const llm_graph_params & params)

Import

#include "llama-model.h"

I/O Contract

Inputs

Name	Type	Required	Description
ml	llama_model_loader &	Yes	Model loader with GGUF data
cparams	llama_context_params	Yes	Context parameters for memory init
params	llm_graph_params	Yes	Graph build parameters

Outputs

Name	Type	Description
hparams	llama_hparams	Loaded model hyperparameters
vocab	llama_vocab	Loaded vocabulary
layers	std::vector<llama_layer>	Per-layer tensor structure
memory	llama_memory_ptr	Initialized memory system

Usage Examples

// Model loading is triggered by the public API:
llama_model * model = llama_model_load_from_file("model.gguf", params);

// Internally this calls:
// model->load_hparams(ml);
// model->load_tensors(ml);
// After context creation: model->init_memory(cparams);

Related Pages

Principle:Ollama_Ollama_Model_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment