Implementation:Ggml org Llama cpp Public API
| Knowledge Sources | |
|---|---|
| Domains | API, Core |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Implements the top-level public C API for llama.cpp, serving as the primary entry point that ties together model loading, context creation, inference, and resource management.
Description
This file delegates to internal subsystems (llama-model, llama-context, llama-vocab, etc.) while exposing a unified C interface. It implements device memory estimation for automatic GPU layer distribution, model parameter fitting (auto n_gpu_layers / n_ctx), sampler chain initialization, and the public lifecycle functions including `llama_model_load_from_file`, `llama_init_from_model`, and `llama_free`. It uses a layer-fraction system to distribute tensor groups across devices and provides utility functions for querying backend capabilities (mmap, mlock, GPU offload, RPC support).
Usage
Use this file as the main entry point for all external consumers of the llama.cpp library. All public-facing API functions defined in `llama.h` are implemented here.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/llama.cpp
- Lines: 1-1174
Signature
// Flash attention type name
const char * llama_flash_attn_type_name(enum llama_flash_attn_type flash_attn_type);
// Backend capability queries
bool llama_supports_mmap(void);
bool llama_supports_mlock(void);
bool llama_supports_gpu_offload(void);
bool llama_supports_rpc(void);
// Backend lifecycle
void llama_backend_init(void);
void llama_numa_init(enum ggml_numa_strategy numa);
void llama_backend_free(void);
// Model loading
struct llama_model * llama_model_load_from_file(
const char * path_model, struct llama_model_params params);
struct llama_model * llama_model_load_from_splits(
const char ** paths, size_t n_paths, struct llama_model_params params);
void llama_model_save_to_file(const struct llama_model * model, const char * path_model);
// Sampler chain
struct llama_sampler_chain_params llama_sampler_chain_default_params();
// System info
const char * llama_print_system_info(void);
Import
#include "llama.h"
#include "llama-impl.h"
#include "llama-chat.h"
#include "llama-context.h"
#include "llama-mmap.h"
#include "llama-vocab.h"
#include "llama-model-loader.h"
#include "llama-model-saver.h"
#include "llama-model.h"
#include "ggml.h"
#include "ggml-backend.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path_model | const char * | Yes | File system path to the GGUF model file |
| params | llama_model_params | Yes | Model loading parameters (n_gpu_layers, use_mmap, devices, etc.) |
| cparams | llama_context_params | No | Context parameters for auto-fitting (n_ctx, n_batch, etc.) |
Outputs
| Name | Type | Description |
|---|---|---|
| model | llama_model * | Loaded model handle, or nullptr on failure |
| supports_* | bool | Backend capability flags for mmap, mlock, GPU, RPC |
| system_info | const char * | Human-readable string describing system capabilities |
| sampler_params | llama_sampler_chain_params | Default sampler chain configuration |
Usage Examples
// Initialize backend
llama_backend_init();
llama_numa_init(GGML_NUMA_STRATEGY_DISTRIBUTE);
// Load model
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 35;
llama_model * model = llama_model_load_from_file("model.gguf", model_params);
// Check capabilities
if (llama_supports_mmap()) {
// model was loaded with mmap
}
// Clean up
llama_model_free(model);
llama_backend_free();