Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Public API

From Leeroopedia
Revision as of 12:41, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Llama_cpp_Public_API.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains API, Core
Last Updated 2026-02-15 00:00 GMT

Overview

Implements the top-level public C API for llama.cpp, serving as the primary entry point that ties together model loading, context creation, inference, and resource management.

Description

This file delegates to internal subsystems (llama-model, llama-context, llama-vocab, etc.) while exposing a unified C interface. It implements device memory estimation for automatic GPU layer distribution, model parameter fitting (auto n_gpu_layers / n_ctx), sampler chain initialization, and the public lifecycle functions including `llama_model_load_from_file`, `llama_init_from_model`, and `llama_free`. It uses a layer-fraction system to distribute tensor groups across devices and provides utility functions for querying backend capabilities (mmap, mlock, GPU offload, RPC support).

Usage

Use this file as the main entry point for all external consumers of the llama.cpp library. All public-facing API functions defined in `llama.h` are implemented here.

Code Reference

Source Location

Signature

// Flash attention type name
const char * llama_flash_attn_type_name(enum llama_flash_attn_type flash_attn_type);

// Backend capability queries
bool llama_supports_mmap(void);
bool llama_supports_mlock(void);
bool llama_supports_gpu_offload(void);
bool llama_supports_rpc(void);

// Backend lifecycle
void llama_backend_init(void);
void llama_numa_init(enum ggml_numa_strategy numa);
void llama_backend_free(void);

// Model loading
struct llama_model * llama_model_load_from_file(
    const char * path_model, struct llama_model_params params);
struct llama_model * llama_model_load_from_splits(
    const char ** paths, size_t n_paths, struct llama_model_params params);
void llama_model_save_to_file(const struct llama_model * model, const char * path_model);

// Sampler chain
struct llama_sampler_chain_params llama_sampler_chain_default_params();

// System info
const char * llama_print_system_info(void);

Import

#include "llama.h"
#include "llama-impl.h"
#include "llama-chat.h"
#include "llama-context.h"
#include "llama-mmap.h"
#include "llama-vocab.h"
#include "llama-model-loader.h"
#include "llama-model-saver.h"
#include "llama-model.h"
#include "ggml.h"
#include "ggml-backend.h"

I/O Contract

Inputs

Name Type Required Description
path_model const char * Yes File system path to the GGUF model file
params llama_model_params Yes Model loading parameters (n_gpu_layers, use_mmap, devices, etc.)
cparams llama_context_params No Context parameters for auto-fitting (n_ctx, n_batch, etc.)

Outputs

Name Type Description
model llama_model * Loaded model handle, or nullptr on failure
supports_* bool Backend capability flags for mmap, mlock, GPU, RPC
system_info const char * Human-readable string describing system capabilities
sampler_params llama_sampler_chain_params Default sampler chain configuration

Usage Examples

// Initialize backend
llama_backend_init();
llama_numa_init(GGML_NUMA_STRATEGY_DISTRIBUTE);

// Load model
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 35;
llama_model * model = llama_model_load_from_file("model.gguf", model_params);

// Check capabilities
if (llama_supports_mmap()) {
    // model was loaded with mmap
}

// Clean up
llama_model_free(model);
llama_backend_free();

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment