Implementation:Ggml org Llama cpp Public API

Knowledge Sources	Ggml_org_Llama_cpp
Domains	API, Core
Last Updated	2026-02-15 00:00 GMT

Overview

Implements the top-level public C API for llama.cpp, serving as the primary entry point that ties together model loading, context creation, inference, and resource management.

Description

This file delegates to internal subsystems (llama-model, llama-context, llama-vocab, etc.) while exposing a unified C interface. It implements device memory estimation for automatic GPU layer distribution, model parameter fitting (auto n_gpu_layers / n_ctx), sampler chain initialization, and the public lifecycle functions including `llama_model_load_from_file`, `llama_init_from_model`, and `llama_free`. It uses a layer-fraction system to distribute tensor groups across devices and provides utility functions for querying backend capabilities (mmap, mlock, GPU offload, RPC support).

Usage

Use this file as the main entry point for all external consumers of the llama.cpp library. All public-facing API functions defined in `llama.h` are implemented here.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama.cpp
Lines: 1-1174

Signature

// Flash attention type name
const char * llama_flash_attn_type_name(enum llama_flash_attn_type flash_attn_type);

// Backend capability queries
bool llama_supports_mmap(void);
bool llama_supports_mlock(void);
bool llama_supports_gpu_offload(void);
bool llama_supports_rpc(void);

// Backend lifecycle
void llama_backend_init(void);
void llama_numa_init(enum ggml_numa_strategy numa);
void llama_backend_free(void);

// Model loading
struct llama_model * llama_model_load_from_file(
    const char * path_model, struct llama_model_params params);
struct llama_model * llama_model_load_from_splits(
    const char ** paths, size_t n_paths, struct llama_model_params params);
void llama_model_save_to_file(const struct llama_model * model, const char * path_model);

// Sampler chain
struct llama_sampler_chain_params llama_sampler_chain_default_params();

// System info
const char * llama_print_system_info(void);

Import

#include "llama.h"
#include "llama-impl.h"
#include "llama-chat.h"
#include "llama-context.h"
#include "llama-mmap.h"
#include "llama-vocab.h"
#include "llama-model-loader.h"
#include "llama-model-saver.h"
#include "llama-model.h"
#include "ggml.h"
#include "ggml-backend.h"

I/O Contract

Inputs

Name	Type	Required	Description
path_model	const char *	Yes	File system path to the GGUF model file
params	llama_model_params	Yes	Model loading parameters (n_gpu_layers, use_mmap, devices, etc.)
cparams	llama_context_params	No	Context parameters for auto-fitting (n_ctx, n_batch, etc.)

Outputs

Name	Type	Description
model	llama_model *	Loaded model handle, or nullptr on failure
supports_*	bool	Backend capability flags for mmap, mlock, GPU, RPC
system_info	const char *	Human-readable string describing system capabilities
sampler_params	llama_sampler_chain_params	Default sampler chain configuration

Usage Examples

// Initialize backend
llama_backend_init();
llama_numa_init(GGML_NUMA_STRATEGY_DISTRIBUTE);

// Load model
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 35;
llama_model * model = llama_model_load_from_file("model.gguf", model_params);

// Check capabilities
if (llama_supports_mmap()) {
    // model was loaded with mmap
}

// Clean up
llama_model_free(model);
llama_backend_free();

Related Pages

Principle:Ggml_org_Llama_cpp_PublicAPI

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment