Implementation:Ggml org Llama cpp Mtmd Init From File
| Aspect | Detail |
|---|---|
| Implementation Name | Mtmd Init From File |
| Doc Type | API Doc |
| Domain | Multimodal Inference |
| Purpose | Initialize the multimodal context from an mmproj GGUF file and a loaded text model |
| Related Workflow | Multimodal_Inference |
| Core | Yes |
Overview
Description
This implementation documents the two primary functions for initializing the multimodal projector context: mtmd_context_params_default() which provides default configuration, and mtmd_init_from_file() which creates the mtmd_context by loading the mmproj GGUF file and linking it to the loaded text model. Together they form the core initialization step of the multimodal pipeline in llama.cpp's libmtmd library.
Usage
These functions are called after the text model has been loaded. The typical pattern is:
- Call
mtmd_context_params_default()to get a default configuration - Optionally modify parameters (GPU usage, threading, media markers, token limits)
- Call
mtmd_init_from_file()with the mmproj path, text model pointer, and params
Code Reference
| Aspect | Detail |
|---|---|
| Source Location (params) | tools/mtmd/mtmd.cpp:104-119
|
| Source Location (init) | tools/mtmd/mtmd.cpp:424-433
|
| Header | tools/mtmd/mtmd.h:106-112
|
| Import | #include "mtmd.h"
|
Default parameters function:
mtmd_context_params mtmd_context_params_default() {
mtmd_context_params params {
/* use_gpu */ true,
/* print_timings */ true,
/* n_threads */ 4,
/* image_marker */ MTMD_DEFAULT_IMAGE_MARKER,
/* media_marker */ mtmd_default_marker(),
/* flash_attn_type */ LLAMA_FLASH_ATTN_TYPE_AUTO,
/* warmup */ true,
/* image_min_tokens */ -1,
/* image_max_tokens */ -1,
/* cb_eval */ nullptr,
/* cb_eval_user_data */ nullptr,
};
return params;
}
Initialization function:
mtmd_context * mtmd_init_from_file(const char * mmproj_fname,
const struct llama_model * text_model,
const struct mtmd_context_params ctx_params) {
try {
return new mtmd_context(mmproj_fname, text_model, ctx_params);
} catch (const std::exception & e) {
LOG_ERR("%s: error: %s\n", __func__, e.what());
return nullptr;
}
}
The mtmd_context_params structure (from mtmd.h):
struct mtmd_context_params {
bool use_gpu;
bool print_timings;
int n_threads;
const char * image_marker; // deprecated, use media_marker instead
const char * media_marker;
enum llama_flash_attn_type flash_attn_type;
bool warmup; // whether to run a warmup encode pass after initialization
int image_min_tokens; // minimum number of tokens for image input (default: read from metadata)
int image_max_tokens; // maximum number of tokens for image input (default: read from metadata)
ggml_backend_sched_eval_callback cb_eval;
void * cb_eval_user_data;
};
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | mmproj_fname | const char * |
File path to the multimodal projector GGUF file |
| Input | text_model | const struct llama_model * |
Pointer to the already-loaded text model (must remain valid for the lifetime of the mtmd_context) |
| Input | ctx_params | struct mtmd_context_params |
Configuration parameters for the multimodal context |
| Output | (return) | mtmd_context * |
Opaque multimodal context handle, or nullptr on failure
|
Parameter details:
- use_gpu (
true): Whether to use GPU acceleration for encoding - print_timings (
true): Whether to print timing information during encoding - n_threads (
4): Number of CPU threads for encoding - media_marker (
"<__media__>"): The marker string in prompts that will be replaced with media embeddings - warmup (
true): Run a warmup encode pass after initialization to pre-compile compute graphs - image_min_tokens / image_max_tokens (
-1): Token count limits for dynamic-resolution vision models; -1 means read from model metadata
Usage Examples
Example 1: Basic multimodal context initialization
#include "mtmd.h"
#include "llama.h"
// Assume text model is already loaded
struct llama_model * model = llama_model_load_from_file("model.gguf", model_params);
// Get default parameters and initialize
struct mtmd_context_params ctx_params = mtmd_context_params_default();
mtmd_context * mtmd_ctx = mtmd_init_from_file("mmproj-f16.gguf", model, ctx_params);
if (mtmd_ctx == nullptr) {
fprintf(stderr, "Failed to initialize multimodal context\n");
return 1;
}
// Check supported modalities
bool has_vision = mtmd_support_vision(mtmd_ctx);
bool has_audio = mtmd_support_audio(mtmd_ctx);
// ... use mtmd_ctx for tokenization and encoding ...
// Cleanup
mtmd_free(mtmd_ctx);
Example 2: Custom parameters with GPU disabled
struct mtmd_context_params ctx_params = mtmd_context_params_default();
ctx_params.use_gpu = false; // CPU-only encoding
ctx_params.n_threads = 8; // use 8 threads
ctx_params.warmup = false; // skip warmup for faster startup
ctx_params.image_max_tokens = 256; // limit image tokens for dynamic resolution models
mtmd_context * mtmd_ctx = mtmd_init_from_file("mmproj-f16.gguf", model, ctx_params);
Example 3: Using C++ RAII wrapper
#include "mtmd.h"
struct mtmd_context_params ctx_params = mtmd_context_params_default();
// The mtmd namespace provides unique_ptr wrappers with custom deleters
mtmd::context_ptr ctx(mtmd_init_from_file("mmproj-f16.gguf", model, ctx_params));
if (!ctx) {
fprintf(stderr, "Failed to initialize multimodal context\n");
return 1;
}
// ctx will be automatically freed when it goes out of scope