Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Mtmd Init From File

From Leeroopedia
Aspect Detail
Implementation Name Mtmd Init From File
Doc Type API Doc
Domain Multimodal Inference
Purpose Initialize the multimodal context from an mmproj GGUF file and a loaded text model
Related Workflow Multimodal_Inference
Core Yes

Overview

Description

This implementation documents the two primary functions for initializing the multimodal projector context: mtmd_context_params_default() which provides default configuration, and mtmd_init_from_file() which creates the mtmd_context by loading the mmproj GGUF file and linking it to the loaded text model. Together they form the core initialization step of the multimodal pipeline in llama.cpp's libmtmd library.

Usage

These functions are called after the text model has been loaded. The typical pattern is:

  1. Call mtmd_context_params_default() to get a default configuration
  2. Optionally modify parameters (GPU usage, threading, media markers, token limits)
  3. Call mtmd_init_from_file() with the mmproj path, text model pointer, and params

Code Reference

Aspect Detail
Source Location (params) tools/mtmd/mtmd.cpp:104-119
Source Location (init) tools/mtmd/mtmd.cpp:424-433
Header tools/mtmd/mtmd.h:106-112
Import #include "mtmd.h"

Default parameters function:

mtmd_context_params mtmd_context_params_default() {
    mtmd_context_params params {
        /* use_gpu           */ true,
        /* print_timings     */ true,
        /* n_threads         */ 4,
        /* image_marker      */ MTMD_DEFAULT_IMAGE_MARKER,
        /* media_marker      */ mtmd_default_marker(),
        /* flash_attn_type   */ LLAMA_FLASH_ATTN_TYPE_AUTO,
        /* warmup            */ true,
        /* image_min_tokens  */ -1,
        /* image_max_tokens  */ -1,
        /* cb_eval           */ nullptr,
        /* cb_eval_user_data */ nullptr,
    };
    return params;
}

Initialization function:

mtmd_context * mtmd_init_from_file(const char * mmproj_fname,
        const struct llama_model * text_model,
        const struct mtmd_context_params ctx_params) {
    try {
        return new mtmd_context(mmproj_fname, text_model, ctx_params);
    } catch (const std::exception & e) {
        LOG_ERR("%s: error: %s\n", __func__, e.what());
        return nullptr;
    }
}

The mtmd_context_params structure (from mtmd.h):

struct mtmd_context_params {
    bool use_gpu;
    bool print_timings;
    int n_threads;
    const char * image_marker;   // deprecated, use media_marker instead
    const char * media_marker;
    enum llama_flash_attn_type flash_attn_type;
    bool warmup;                 // whether to run a warmup encode pass after initialization
    int image_min_tokens;        // minimum number of tokens for image input (default: read from metadata)
    int image_max_tokens;        // maximum number of tokens for image input (default: read from metadata)
    ggml_backend_sched_eval_callback cb_eval;
    void * cb_eval_user_data;
};

I/O Contract

Direction Name Type Description
Input mmproj_fname const char * File path to the multimodal projector GGUF file
Input text_model const struct llama_model * Pointer to the already-loaded text model (must remain valid for the lifetime of the mtmd_context)
Input ctx_params struct mtmd_context_params Configuration parameters for the multimodal context
Output (return) mtmd_context * Opaque multimodal context handle, or nullptr on failure

Parameter details:

  • use_gpu (true): Whether to use GPU acceleration for encoding
  • print_timings (true): Whether to print timing information during encoding
  • n_threads (4): Number of CPU threads for encoding
  • media_marker ("<__media__>"): The marker string in prompts that will be replaced with media embeddings
  • warmup (true): Run a warmup encode pass after initialization to pre-compile compute graphs
  • image_min_tokens / image_max_tokens (-1): Token count limits for dynamic-resolution vision models; -1 means read from model metadata

Usage Examples

Example 1: Basic multimodal context initialization

#include "mtmd.h"
#include "llama.h"

// Assume text model is already loaded
struct llama_model * model = llama_model_load_from_file("model.gguf", model_params);

// Get default parameters and initialize
struct mtmd_context_params ctx_params = mtmd_context_params_default();
mtmd_context * mtmd_ctx = mtmd_init_from_file("mmproj-f16.gguf", model, ctx_params);

if (mtmd_ctx == nullptr) {
    fprintf(stderr, "Failed to initialize multimodal context\n");
    return 1;
}

// Check supported modalities
bool has_vision = mtmd_support_vision(mtmd_ctx);
bool has_audio  = mtmd_support_audio(mtmd_ctx);

// ... use mtmd_ctx for tokenization and encoding ...

// Cleanup
mtmd_free(mtmd_ctx);

Example 2: Custom parameters with GPU disabled

struct mtmd_context_params ctx_params = mtmd_context_params_default();
ctx_params.use_gpu = false;       // CPU-only encoding
ctx_params.n_threads = 8;         // use 8 threads
ctx_params.warmup = false;        // skip warmup for faster startup
ctx_params.image_max_tokens = 256; // limit image tokens for dynamic resolution models

mtmd_context * mtmd_ctx = mtmd_init_from_file("mmproj-f16.gguf", model, ctx_params);

Example 3: Using C++ RAII wrapper

#include "mtmd.h"

struct mtmd_context_params ctx_params = mtmd_context_params_default();

// The mtmd namespace provides unique_ptr wrappers with custom deleters
mtmd::context_ptr ctx(mtmd_init_from_file("mmproj-f16.gguf", model, ctx_params));
if (!ctx) {
    fprintf(stderr, "Failed to initialize multimodal context\n");
    return 1;
}
// ctx will be automatically freed when it goes out of scope

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment