Implementation:Ggml org Llama cpp Mtmd Helper

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Multimodal, Utilities
Last Updated	2026-02-15 00:00 GMT

Overview

Public helper library providing convenience functions for loading media files and evaluating multimodal input chunks through the llama.cpp inference pipeline.

Description

Embeds `stb_image.h` (for image loading) and `miniaudio.h` (for audio decoding) as single-header implementations. Provides `mtmd_helper_bitmap_init_from_file` and `mtmd_helper_bitmap_init_from_buf` to auto-detect and load images (via stb) or audio (via miniaudio with resampling to model sample rate). Implements `mtmd_helper_eval_chunks` which iterates over text/image/audio chunks, running `llama_decode` for text tokens and `mtmd_encode` plus embedding decode for media tokens, with proper batching and position tracking. Includes special handling for models requiring non-causal attention during image decoding.

Usage

Use this library when building applications that need to load media files and feed multimodal input to llama.cpp models. It simplifies the complex orchestration of encoding and decoding interleaved text, image, and audio sequences.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: tools/mtmd/mtmd-helper.cpp
Lines: 1-521

Signature

// Media file loading
mtmd_bitmap * mtmd_helper_bitmap_init_from_file(const char * path,
    mtmd_context * ctx);
mtmd_bitmap * mtmd_helper_bitmap_init_from_buf(const unsigned char * buf,
    size_t len, mtmd_context * ctx);

// Multimodal chunk evaluation
int32_t mtmd_helper_eval_chunks(mtmd_context * ctx_mtmd,
    llama_context * ctx_llama,
    mtmd_input_chunks * chunks,
    llama_pos pos0,
    llama_seq_id seq_id,
    int32_t n_batch,
    bool logits_last);

Import

#include "mtmd.h"
#include "mtmd-helper.h"
#include "llama.h"
#include "stb/stb_image.h"
#include "miniaudio/miniaudio.h"

I/O Contract

Inputs

Name	Type	Required	Description
path	const char*	Yes (for file loading)	Path to an image or audio file to load
buf / len	unsigned char* / size_t	Yes (for buffer loading)	Raw file data buffer and its length
ctx_mtmd	mtmd_context*	Yes	Initialized multimodal context
ctx_llama	llama_context*	Yes	Initialized llama context for decoding
chunks	mtmd_input_chunks*	Yes	Tokenized multimodal input chunks to evaluate
pos0	llama_pos	Yes	Starting position in the KV cache
seq_id	llama_seq_id	Yes	Sequence ID for the KV cache
n_batch	int32_t	Yes	Maximum batch size for decoding

Outputs

Name	Type	Description
mtmd_bitmap*	pointer	Loaded bitmap/audio data ready for multimodal tokenization
return code	int32_t	Number of tokens processed, or negative value on error

Usage Examples

#include "mtmd-helper.h"

// Load an image file
mtmd_bitmap * bmp = mtmd_helper_bitmap_init_from_file("photo.jpg", ctx_mtmd);

// Tokenize text with embedded image
mtmd_input_text text = { "Describe this image: <__image__>", true, true };
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
mtmd_tokenize(ctx_mtmd, chunks, &text, &bmp, 1);

// Evaluate all chunks through the model
int32_t n_past = mtmd_helper_eval_chunks(
    ctx_mtmd, ctx_llama, chunks, 0, 0, 512, true);

Related Pages

Principle:Ggml_org_Llama_cpp_Multimodal

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment