Implementation:Ollama Ollama Mtmd API
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, API |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Public header for libmtmd, the multimodal support library for llama.cpp, declaring the C API and C++ convenience wrappers for vision and audio processing.
Description
Defines the complete public interface for multimodal functionality. Declares opaque types (mtmd_context, mtmd_bitmap, mtmd_image_tokens, mtmd_input_chunk, mtmd_input_chunks), the mtmd_input_text struct, mtmd_context_params configuration, and the mtmd_input_chunk_type enum (text/image/audio). The C API covers context lifecycle (mtmd_init_from_file, mtmd_free), bitmap management (create, set ID, get data), tokenization (mtmd_tokenize), encoding (mtmd_encode), output retrieval (mtmd_get_output_embd), and model capability queries (vision/audio support, M-RoPE usage, audio bitrate). The C++ section provides RAII smart pointer wrappers and a mtmd namespace with bitmap, bitmaps, and input_chunks helper types.
Usage
Included by any code that needs to process images or audio alongside text, including mtmd-helper.cpp, mtmd-cli.cpp, and Ollama's Go CGo bridge.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/mtmd.h
- Lines: 1-315
Signature
enum mtmd_input_chunk_type {
MTMD_INPUT_CHUNK_TYPE_TEXT,
MTMD_INPUT_CHUNK_TYPE_IMAGE,
MTMD_INPUT_CHUNK_TYPE_AUDIO,
};
struct mtmd_context_params {
bool use_gpu;
bool print_timings;
int n_threads;
const char * media_marker;
enum llama_flash_attn_type flash_attn_type;
bool warmup;
int image_min_tokens;
int image_max_tokens;
};
MTMD_API mtmd_context * mtmd_init_from_file(const char * mmproj_fname,
const struct llama_model * text_model,
const struct mtmd_context_params ctx_params);
MTMD_API void mtmd_free(mtmd_context * ctx);
MTMD_API int32_t mtmd_tokenize(mtmd_context * ctx, mtmd_input_chunks * chunks,
const mtmd_input_text * text, const mtmd_bitmap ** bitmaps, size_t n_bitmaps);
MTMD_API int32_t mtmd_encode(mtmd_context * ctx, const mtmd_input_chunk * chunk);
MTMD_API float * mtmd_get_output_embd(mtmd_context * ctx);
MTMD_API bool mtmd_support_vision(mtmd_context * ctx);
MTMD_API bool mtmd_support_audio(mtmd_context * ctx);
MTMD_API bool mtmd_decode_use_mrope(mtmd_context * ctx);
Import
#include "mtmd.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| mmproj_fname | const char * | Yes | Path to multimodal projector GGUF file |
| text_model | llama_model * | Yes | Loaded LLM for tokenization |
| ctx_params | mtmd_context_params | Yes | Configuration for GPU, threads, markers |
| text | mtmd_input_text * | Yes | Text prompt with media markers |
| bitmaps | mtmd_bitmap ** | Yes | Array of image/audio bitmaps |
Outputs
| Name | Type | Description |
|---|---|---|
| mtmd_context * | pointer | Initialized multimodal context |
| chunks | mtmd_input_chunks * | Tokenized text/media chunk sequence |
| embd | float * | Encoded media embedding vector |
Usage Examples
// C API usage
struct mtmd_context_params params = mtmd_context_params_default();
mtmd_context * ctx = mtmd_init_from_file("mmproj.gguf", model, params);
mtmd_bitmap * bmp = mtmd_bitmap_init(224, 224, rgb_data);
mtmd_bitmap_set_id(bmp, "img_001");
mtmd_input_text * text = mtmd_input_text_init("Describe: <__media__>", true, true);
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
mtmd_tokenize(ctx, chunks, text, (const mtmd_bitmap **)&bmp, 1);
for (size_t i = 0; i < mtmd_input_chunks_size(chunks); i++) {
const mtmd_input_chunk * chunk = mtmd_input_chunks_get(chunks, i);
if (mtmd_input_chunk_get_type(chunk) == MTMD_INPUT_CHUNK_TYPE_IMAGE) {
mtmd_encode(ctx, chunk);
}
}
mtmd_input_chunks_free(chunks);
mtmd_bitmap_free(bmp);
mtmd_free(ctx);