Implementation:Ggml org Llama cpp Llama Model Load For Multimodal

Aspect	Detail
Implementation Name	Llama Model Load For Multimodal
Doc Type	Pattern Doc
Domain	Multimodal Inference
Purpose	Loading the text GGUF model as foundation for the multimodal pipeline
Related Workflow	Multimodal_Inference

Overview

Description

This pattern documents loading the text/language GGUF model using llama_model_load_from_file(), which is the standard entry point for model loading in llama.cpp. In the multimodal pipeline, this is the first step: the resulting llama_model * pointer is subsequently passed to mtmd_init_from_file() to establish the multimodal projector context.

Usage

The text model must be loaded before any multimodal context is created. The returned llama_model * is treated as the language backbone throughout the entire multimodal session. It is passed as a const pointer to the multimodal initialization, meaning the mtmd layer reads model metadata (embedding dimensions, vocabulary) but does not modify the model itself.

Code Reference

Aspect	Detail
Source Location	`include/llama.h:450-452`
Signature	`struct llama_model * llama_model_load_from_file(const char * path_model, struct llama_model_params params)`
Import	`#include "llama.h"`

The function loads a GGUF model file from disk and returns an opaque model handle. If the model file is split into multiple parts, the filename must follow the pattern <name>-%05d-of-%05d.gguf. For custom split naming, use llama_model_load_from_splits() instead.

I/O Contract

Direction	Name	Type	Description
Input	path_model	`const char *`	File path to the GGUF text model
Input	params	`struct llama_model_params`	Model loading parameters (GPU layers, mmap, mlock, split mode)
Output	(return)	`struct llama_model *`	Opaque model handle, or `NULL` on failure

Usage Examples

Example 1: Basic multimodal model loading pattern

#include "llama.h"

// Initialize the backend
llama_backend_init();

// Configure model parameters
struct llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 35;  // offload 35 layers to GPU

// Load the text model
struct llama_model * model = llama_model_load_from_file(
    "models/llava-v1.6-vicuna-7b-Q4_K_M.gguf",
    model_params
);

if (model == NULL) {
    fprintf(stderr, "Failed to load text model\n");
    return 1;
}

// The model pointer is now ready to be passed to mtmd_init_from_file()
// along with the mmproj GGUF path

Example 2: Loading with context creation for inference

#include "llama.h"

struct llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 99;  // offload all layers

struct llama_model * model = llama_model_load_from_file("model.gguf", model_params);

// Create a context for inference
struct llama_context_params ctx_params = llama_context_default_params();
ctx_params.n_ctx = 4096;
ctx_params.n_batch = 512;

struct llama_context * ctx = llama_init_from_model(model, ctx_params);

// Both model and ctx are needed for the full multimodal pipeline

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment