Implementation:Ggml org Llama cpp Mtmd Helper
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Utilities |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Public helper library providing convenience functions for loading media files and evaluating multimodal input chunks through the llama.cpp inference pipeline.
Description
Embeds `stb_image.h` (for image loading) and `miniaudio.h` (for audio decoding) as single-header implementations. Provides `mtmd_helper_bitmap_init_from_file` and `mtmd_helper_bitmap_init_from_buf` to auto-detect and load images (via stb) or audio (via miniaudio with resampling to model sample rate). Implements `mtmd_helper_eval_chunks` which iterates over text/image/audio chunks, running `llama_decode` for text tokens and `mtmd_encode` plus embedding decode for media tokens, with proper batching and position tracking. Includes special handling for models requiring non-causal attention during image decoding.
Usage
Use this library when building applications that need to load media files and feed multimodal input to llama.cpp models. It simplifies the complex orchestration of encoding and decoding interleaved text, image, and audio sequences.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/mtmd/mtmd-helper.cpp
- Lines: 1-521
Signature
// Media file loading
mtmd_bitmap * mtmd_helper_bitmap_init_from_file(const char * path,
mtmd_context * ctx);
mtmd_bitmap * mtmd_helper_bitmap_init_from_buf(const unsigned char * buf,
size_t len, mtmd_context * ctx);
// Multimodal chunk evaluation
int32_t mtmd_helper_eval_chunks(mtmd_context * ctx_mtmd,
llama_context * ctx_llama,
mtmd_input_chunks * chunks,
llama_pos pos0,
llama_seq_id seq_id,
int32_t n_batch,
bool logits_last);
Import
#include "mtmd.h"
#include "mtmd-helper.h"
#include "llama.h"
#include "stb/stb_image.h"
#include "miniaudio/miniaudio.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | const char* | Yes (for file loading) | Path to an image or audio file to load |
| buf / len | unsigned char* / size_t | Yes (for buffer loading) | Raw file data buffer and its length |
| ctx_mtmd | mtmd_context* | Yes | Initialized multimodal context |
| ctx_llama | llama_context* | Yes | Initialized llama context for decoding |
| chunks | mtmd_input_chunks* | Yes | Tokenized multimodal input chunks to evaluate |
| pos0 | llama_pos | Yes | Starting position in the KV cache |
| seq_id | llama_seq_id | Yes | Sequence ID for the KV cache |
| n_batch | int32_t | Yes | Maximum batch size for decoding |
Outputs
| Name | Type | Description |
|---|---|---|
| mtmd_bitmap* | pointer | Loaded bitmap/audio data ready for multimodal tokenization |
| return code | int32_t | Number of tokens processed, or negative value on error |
Usage Examples
#include "mtmd-helper.h"
// Load an image file
mtmd_bitmap * bmp = mtmd_helper_bitmap_init_from_file("photo.jpg", ctx_mtmd);
// Tokenize text with embedded image
mtmd_input_text text = { "Describe this image: <__image__>", true, true };
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
mtmd_tokenize(ctx_mtmd, chunks, &text, &bmp, 1);
// Evaluate all chunks through the model
int32_t n_past = mtmd_helper_eval_chunks(
ctx_mtmd, ctx_llama, chunks, 0, 0, 512, true);