Implementation:Ollama Ollama Mtmd Helper
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, InferenceHelper |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
High-level helper library that simplifies multimodal input handling for applications using the mtmd API, including bitmap loading, token counting, and chunk evaluation.
Description
Provides convenience functions wrapping lower-level mtmd and llama APIs. Includes bitmap loading from files and buffers using stb_image for images and miniaudio for audio formats (WAV/MP3/FLAC with automatic detection via magic bytes). The key mtmd_helper_eval_chunks function processes a sequence of text and image/audio chunks by dispatching text chunks to llama_decode() and media chunks through mtmd_encode() then llama_decode() with proper embedding handling and M-RoPE position tracking. The decode_embd_batch helper struct manages embedding batch construction with support for normal, M-RoPE 2D (images), and M-RoPE 1D (audio) position layouts.
Usage
Used by application code to load media files and evaluate mixed text/media sequences in a single call, abstracting away the complexity of interleaving text and media encoding/decoding.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/mtmd-helper.cpp
- Lines: 1-521
Signature
size_t mtmd_helper_get_n_tokens(const mtmd_input_chunks * chunks);
llama_pos mtmd_helper_get_n_pos(const mtmd_input_chunks * chunks);
void mtmd_helper_log_set(ggml_log_callback log_callback, void * user_data);
struct decode_embd_batch {
llama_batch batch;
decode_embd_batch(float * embd, int32_t n_tokens, int n_pos_per_embd, int n_mmproj_embd);
void set_position_normal(llama_pos pos_0, llama_seq_id seq_id);
void set_position_mrope_2d(llama_pos pos_0, int nx, int ny, llama_seq_id seq_id);
void set_position_mrope_1d(llama_pos pos_0, llama_seq_id seq_id);
llama_batch get_view(int offset, int n_tokens);
};
Import
#include "mtmd-helper.h"
#include "mtmd.h"
#include "llama.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| chunks | mtmd_input_chunks * | Yes | Tokenized input chunks from mtmd_tokenize |
| lctx | llama_context * | Yes | LLM context for text decoding |
| pos0 | llama_pos | Yes | Starting position for M-RoPE tracking |
| seq_id | llama_seq_id | Yes | Sequence ID for batched inference |
Outputs
| Name | Type | Description |
|---|---|---|
| n_tokens | size_t | Total number of tokens across all chunks |
| n_pos | llama_pos | Total positional extent of all chunks |
| return code | int | 0 on success, negative on error |
Usage Examples
// Load an image and evaluate
mtmd_bitmap * bmp = mtmd_helper_bitmap_init_from_file("photo.jpg");
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
mtmd_tokenize(ctx, chunks, text_input, &bmp, 1);
// Evaluate all chunks (text + image) in one call
int result = mtmd_helper_eval_chunks(ctx, lctx, chunks, n_past, seq_id, n_batch,
/* logits_last */ true);