Implementation:Ollama Ollama Llama Model
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Model Loading |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the llama_model class, including model loading from GGUF files, hyperparameter initialization, tensor creation, buffer allocation, and memory system initialization for all supported architectures.
Description
The load_hparams function reads hyperparameters from GGUF metadata for each architecture. load_tensors creates the model's ggml tensors, allocates backend buffers (with GPU offloading based on n_gpu_layers), and loads tensor data. init_memory creates the appropriate memory system (KV cache, ISWA cache, recurrent memory, or hybrid memory) based on the model architecture. Contains per-architecture tensor layout definitions and layer device assignment logic for multi-GPU tensor parallelism.
Usage
This is the largest and most critical file in llama.cpp's source. It is the single place where all model architectures are loaded and initialized. Every model that Ollama serves is loaded through this file's logic.
Code Reference
Source Location
- Repository: Ollama
- File:
llama/llama.cpp/src/llama-model.cpp - Lines: 1-8060
Signature
const char * llm_type_name(llm_type type);
static const char * llama_expert_gating_func_name(llama_expert_gating_func_type type);
// Key model operations (defined in llama-model.h, implemented here):
// llama_model::load_hparams(llama_model_loader & ml)
// llama_model::load_tensors(llama_model_loader & ml)
// llama_model::init_memory(const llama_context_params & cparams)
// llama_model::build_graph(const llm_graph_params & params)
Import
#include "llama-model.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ml | llama_model_loader & | Yes | Model loader with GGUF data |
| cparams | llama_context_params | Yes | Context parameters for memory init |
| params | llm_graph_params | Yes | Graph build parameters |
Outputs
| Name | Type | Description |
|---|---|---|
| hparams | llama_hparams | Loaded model hyperparameters |
| vocab | llama_vocab | Loaded vocabulary |
| layers | std::vector<llama_layer> | Per-layer tensor structure |
| memory | llama_memory_ptr | Initialized memory system |
Usage Examples
// Model loading is triggered by the public API:
llama_model * model = llama_model_load_from_file("model.gguf", params);
// Internally this calls:
// model->load_hparams(ml);
// model->load_tensors(ml);
// After context creation: model->init_memory(cparams);