Implementation:Ggml org Llama cpp Mtmd Init From File

Aspect	Detail
Implementation Name	Mtmd Init From File
Doc Type	API Doc
Domain	Multimodal Inference
Purpose	Initialize the multimodal context from an mmproj GGUF file and a loaded text model
Related Workflow	Multimodal_Inference
Core	Yes

Overview

Description

This implementation documents the two primary functions for initializing the multimodal projector context: mtmd_context_params_default() which provides default configuration, and mtmd_init_from_file() which creates the mtmd_context by loading the mmproj GGUF file and linking it to the loaded text model. Together they form the core initialization step of the multimodal pipeline in llama.cpp's libmtmd library.

Usage

These functions are called after the text model has been loaded. The typical pattern is:

Call mtmd_context_params_default() to get a default configuration
Optionally modify parameters (GPU usage, threading, media markers, token limits)
Call mtmd_init_from_file() with the mmproj path, text model pointer, and params

Code Reference

Aspect	Detail
Source Location (params)	`tools/mtmd/mtmd.cpp:104-119`
Source Location (init)	`tools/mtmd/mtmd.cpp:424-433`
Header	`tools/mtmd/mtmd.h:106-112`
Import	`#include "mtmd.h"`

Default parameters function:

mtmd_context_params mtmd_context_params_default() {
    mtmd_context_params params {
        /* use_gpu           */ true,
        /* print_timings     */ true,
        /* n_threads         */ 4,
        /* image_marker      */ MTMD_DEFAULT_IMAGE_MARKER,
        /* media_marker      */ mtmd_default_marker(),
        /* flash_attn_type   */ LLAMA_FLASH_ATTN_TYPE_AUTO,
        /* warmup            */ true,
        /* image_min_tokens  */ -1,
        /* image_max_tokens  */ -1,
        /* cb_eval           */ nullptr,
        /* cb_eval_user_data */ nullptr,
    };
    return params;
}

Initialization function:

mtmd_context * mtmd_init_from_file(const char * mmproj_fname,
        const struct llama_model * text_model,
        const struct mtmd_context_params ctx_params) {
    try {
        return new mtmd_context(mmproj_fname, text_model, ctx_params);
    } catch (const std::exception & e) {
        LOG_ERR("%s: error: %s\n", __func__, e.what());
        return nullptr;
    }
}

The mtmd_context_params structure (from mtmd.h):

struct mtmd_context_params {
    bool use_gpu;
    bool print_timings;
    int n_threads;
    const char * image_marker;   // deprecated, use media_marker instead
    const char * media_marker;
    enum llama_flash_attn_type flash_attn_type;
    bool warmup;                 // whether to run a warmup encode pass after initialization
    int image_min_tokens;        // minimum number of tokens for image input (default: read from metadata)
    int image_max_tokens;        // maximum number of tokens for image input (default: read from metadata)
    ggml_backend_sched_eval_callback cb_eval;
    void * cb_eval_user_data;
};

I/O Contract

Direction	Name	Type	Description
Input	mmproj_fname	`const char *`	File path to the multimodal projector GGUF file
Input	text_model	`const struct llama_model *`	Pointer to the already-loaded text model (must remain valid for the lifetime of the mtmd_context)
Input	ctx_params	`struct mtmd_context_params`	Configuration parameters for the multimodal context
Output	(return)	`mtmd_context *`	Opaque multimodal context handle, or `nullptr` on failure

Parameter details:

use_gpu (true): Whether to use GPU acceleration for encoding
print_timings (true): Whether to print timing information during encoding
n_threads (4): Number of CPU threads for encoding
media_marker ("<__media__>"): The marker string in prompts that will be replaced with media embeddings
warmup (true): Run a warmup encode pass after initialization to pre-compile compute graphs
image_min_tokens / image_max_tokens (-1): Token count limits for dynamic-resolution vision models; -1 means read from model metadata

Usage Examples

Example 1: Basic multimodal context initialization

#include "mtmd.h"
#include "llama.h"

// Assume text model is already loaded
struct llama_model * model = llama_model_load_from_file("model.gguf", model_params);

// Get default parameters and initialize
struct mtmd_context_params ctx_params = mtmd_context_params_default();
mtmd_context * mtmd_ctx = mtmd_init_from_file("mmproj-f16.gguf", model, ctx_params);

if (mtmd_ctx == nullptr) {
    fprintf(stderr, "Failed to initialize multimodal context\n");
    return 1;
}

// Check supported modalities
bool has_vision = mtmd_support_vision(mtmd_ctx);
bool has_audio  = mtmd_support_audio(mtmd_ctx);

// ... use mtmd_ctx for tokenization and encoding ...

// Cleanup
mtmd_free(mtmd_ctx);

Example 2: Custom parameters with GPU disabled

struct mtmd_context_params ctx_params = mtmd_context_params_default();
ctx_params.use_gpu = false;       // CPU-only encoding
ctx_params.n_threads = 8;         // use 8 threads
ctx_params.warmup = false;        // skip warmup for faster startup
ctx_params.image_max_tokens = 256; // limit image tokens for dynamic resolution models

mtmd_context * mtmd_ctx = mtmd_init_from_file("mmproj-f16.gguf", model, ctx_params);

Example 3: Using C++ RAII wrapper

#include "mtmd.h"

struct mtmd_context_params ctx_params = mtmd_context_params_default();

// The mtmd namespace provides unique_ptr wrappers with custom deleters
mtmd::context_ptr ctx(mtmd_init_from_file("mmproj-f16.gguf", model, ctx_params));
if (!ctx) {
    fprintf(stderr, "Failed to initialize multimodal context\n");
    return 1;
}
// ctx will be automatically freed when it goes out of scope

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment