Implementation:Ggml org Llama cpp Mtmd Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Multimodal, API
Last Updated	2026-02-15 00:00 GMT

Overview

Public C/C++ API header for the libmtmd multimodal library, defining the contract between libmtmd and all consuming code.

Description

Defines the `MTMD_API` export macro and opaque types (`mtmd_context`, `mtmd_bitmap`, `mtmd_image_tokens`, `mtmd_input_chunk`, `mtmd_input_chunks`). Declares the C API for context creation/destruction with `mtmd_context_params`, bitmap management (init/free/get properties), tokenization of interleaved text+media input, vision/audio encoding, output embedding retrieval, and chunk inspection. Provides C++ wrappers with RAII smart pointer types (`mtmd_context_deleter`, `mtmd_bitmap_deleter`, etc.) and a convenience namespace `mtmd` with `bitmap`, `bitmaps`, and `input_chunks` types.

Usage

Include this header in any application, tool, or library that needs to interact with the multimodal subsystem. It is the primary public interface for all multimodal operations in llama.cpp, used by CLI tools, the server, and external applications.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: tools/mtmd/mtmd.h
Lines: 1-319

Signature

// Opaque types
typedef struct mtmd_context      mtmd_context;
typedef struct mtmd_bitmap       mtmd_bitmap;
typedef struct mtmd_image_tokens mtmd_image_tokens;
typedef struct mtmd_input_chunk  mtmd_input_chunk;
typedef struct mtmd_input_chunks mtmd_input_chunks;

// Input text structure
struct mtmd_input_text {
    const char * text;
    bool add_special;
    bool parse_special;
};

// Chunk types
enum mtmd_input_chunk_type {
    MTMD_INPUT_CHUNK_TYPE_TEXT,
    MTMD_INPUT_CHUNK_TYPE_IMAGE,
    MTMD_INPUT_CHUNK_TYPE_AUDIO,
};

// Context creation / destruction
MTMD_API mtmd_context * mtmd_init_from_file(const char * mmproj_path,
    const struct llama_model * text_model, const struct mtmd_context_params params);
MTMD_API void mtmd_free(mtmd_context * ctx);

// Tokenization and encoding
MTMD_API int32_t mtmd_tokenize(mtmd_context * ctx,
    mtmd_input_chunks * output, const mtmd_input_text * text,
    const mtmd_bitmap ** bitmaps, size_t n_bitmaps);
MTMD_API int32_t mtmd_encode(mtmd_context * ctx,
    const mtmd_input_chunk * chunk);

Import

#include "ggml.h"
#include "llama.h"
// C standard headers
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>

I/O Contract

Inputs

Name	Type	Required	Description
mmproj_path	const char*	Yes	Path to the multimodal projector GGUF file
text_model	llama_model*	Yes	Pointer to the loaded text model
params	mtmd_context_params	Yes	Context configuration (n_threads, verbosity, image marker, etc.)
text	mtmd_input_text	Yes	Input text with media markers for tokenization
bitmaps	mtmd_bitmap**	No	Array of loaded image/audio bitmaps corresponding to markers

Outputs

Name	Type	Description
mtmd_context*	pointer	Initialized multimodal context ready for encoding operations
mtmd_input_chunks*	pointer	Tokenized input split into text and media chunks
embeddings	float*	Encoded media embeddings after mtmd_encode
return code	int32_t	0 on success, negative on error

Usage Examples

#include "mtmd.h"

// Initialize multimodal context
struct mtmd_context_params params = mtmd_context_default_params();
params.n_threads = 4;
mtmd_context * ctx = mtmd_init_from_file("mmproj.gguf", text_model, params);

// Tokenize interleaved text + image input
mtmd_input_text text = { "What is in this image? <__image__>", true, true };
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
mtmd_tokenize(ctx, chunks, &text, &bmp, 1);

// Encode media chunks
for (size_t i = 0; i < mtmd_input_chunks_size(chunks); i++) {
    const mtmd_input_chunk * chunk = mtmd_input_chunks_get(chunks, i);
    if (mtmd_input_chunk_get_type(chunk) == MTMD_INPUT_CHUNK_TYPE_IMAGE) {
        mtmd_encode(ctx, chunk);
    }
}

mtmd_free(ctx);

Related Pages

Principle:Ggml_org_Llama_cpp_Multimodal

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment