Implementation:Ggml org Llama cpp Mtmd Header
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, API |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Public C/C++ API header for the libmtmd multimodal library, defining the contract between libmtmd and all consuming code.
Description
Defines the `MTMD_API` export macro and opaque types (`mtmd_context`, `mtmd_bitmap`, `mtmd_image_tokens`, `mtmd_input_chunk`, `mtmd_input_chunks`). Declares the C API for context creation/destruction with `mtmd_context_params`, bitmap management (init/free/get properties), tokenization of interleaved text+media input, vision/audio encoding, output embedding retrieval, and chunk inspection. Provides C++ wrappers with RAII smart pointer types (`mtmd_context_deleter`, `mtmd_bitmap_deleter`, etc.) and a convenience namespace `mtmd` with `bitmap`, `bitmaps`, and `input_chunks` types.
Usage
Include this header in any application, tool, or library that needs to interact with the multimodal subsystem. It is the primary public interface for all multimodal operations in llama.cpp, used by CLI tools, the server, and external applications.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/mtmd/mtmd.h
- Lines: 1-319
Signature
// Opaque types
typedef struct mtmd_context mtmd_context;
typedef struct mtmd_bitmap mtmd_bitmap;
typedef struct mtmd_image_tokens mtmd_image_tokens;
typedef struct mtmd_input_chunk mtmd_input_chunk;
typedef struct mtmd_input_chunks mtmd_input_chunks;
// Input text structure
struct mtmd_input_text {
const char * text;
bool add_special;
bool parse_special;
};
// Chunk types
enum mtmd_input_chunk_type {
MTMD_INPUT_CHUNK_TYPE_TEXT,
MTMD_INPUT_CHUNK_TYPE_IMAGE,
MTMD_INPUT_CHUNK_TYPE_AUDIO,
};
// Context creation / destruction
MTMD_API mtmd_context * mtmd_init_from_file(const char * mmproj_path,
const struct llama_model * text_model, const struct mtmd_context_params params);
MTMD_API void mtmd_free(mtmd_context * ctx);
// Tokenization and encoding
MTMD_API int32_t mtmd_tokenize(mtmd_context * ctx,
mtmd_input_chunks * output, const mtmd_input_text * text,
const mtmd_bitmap ** bitmaps, size_t n_bitmaps);
MTMD_API int32_t mtmd_encode(mtmd_context * ctx,
const mtmd_input_chunk * chunk);
Import
#include "ggml.h"
#include "llama.h"
// C standard headers
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| mmproj_path | const char* | Yes | Path to the multimodal projector GGUF file |
| text_model | llama_model* | Yes | Pointer to the loaded text model |
| params | mtmd_context_params | Yes | Context configuration (n_threads, verbosity, image marker, etc.) |
| text | mtmd_input_text | Yes | Input text with media markers for tokenization |
| bitmaps | mtmd_bitmap** | No | Array of loaded image/audio bitmaps corresponding to markers |
Outputs
| Name | Type | Description |
|---|---|---|
| mtmd_context* | pointer | Initialized multimodal context ready for encoding operations |
| mtmd_input_chunks* | pointer | Tokenized input split into text and media chunks |
| embeddings | float* | Encoded media embeddings after mtmd_encode |
| return code | int32_t | 0 on success, negative on error |
Usage Examples
#include "mtmd.h"
// Initialize multimodal context
struct mtmd_context_params params = mtmd_context_default_params();
params.n_threads = 4;
mtmd_context * ctx = mtmd_init_from_file("mmproj.gguf", text_model, params);
// Tokenize interleaved text + image input
mtmd_input_text text = { "What is in this image? <__image__>", true, true };
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
mtmd_tokenize(ctx, chunks, &text, &bmp, 1);
// Encode media chunks
for (size_t i = 0; i < mtmd_input_chunks_size(chunks); i++) {
const mtmd_input_chunk * chunk = mtmd_input_chunks_get(chunks, i);
if (mtmd_input_chunk_get_type(chunk) == MTMD_INPUT_CHUNK_TYPE_IMAGE) {
mtmd_encode(ctx, chunk);
}
}
mtmd_free(ctx);