Implementation:Ollama Ollama Mtmd Helper

Knowledge Sources	Ollama
Domains	Multimodal, InferenceHelper
Last Updated	2025-02-15 00:00 GMT

Overview

High-level helper library that simplifies multimodal input handling for applications using the mtmd API, including bitmap loading, token counting, and chunk evaluation.

Description

Provides convenience functions wrapping lower-level mtmd and llama APIs. Includes bitmap loading from files and buffers using stb_image for images and miniaudio for audio formats (WAV/MP3/FLAC with automatic detection via magic bytes). The key mtmd_helper_eval_chunks function processes a sequence of text and image/audio chunks by dispatching text chunks to llama_decode() and media chunks through mtmd_encode() then llama_decode() with proper embedding handling and M-RoPE position tracking. The decode_embd_batch helper struct manages embedding batch construction with support for normal, M-RoPE 2D (images), and M-RoPE 1D (audio) position layouts.

Usage

Used by application code to load media files and evaluate mixed text/media sequences in a single call, abstracting away the complexity of interleaving text and media encoding/decoding.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/tools/mtmd/mtmd-helper.cpp
Lines: 1-521

Signature

size_t mtmd_helper_get_n_tokens(const mtmd_input_chunks * chunks);
llama_pos mtmd_helper_get_n_pos(const mtmd_input_chunks * chunks);
void mtmd_helper_log_set(ggml_log_callback log_callback, void * user_data);

struct decode_embd_batch {
    llama_batch batch;
    decode_embd_batch(float * embd, int32_t n_tokens, int n_pos_per_embd, int n_mmproj_embd);
    void set_position_normal(llama_pos pos_0, llama_seq_id seq_id);
    void set_position_mrope_2d(llama_pos pos_0, int nx, int ny, llama_seq_id seq_id);
    void set_position_mrope_1d(llama_pos pos_0, llama_seq_id seq_id);
    llama_batch get_view(int offset, int n_tokens);
};

Import

#include "mtmd-helper.h"
#include "mtmd.h"
#include "llama.h"

I/O Contract

Inputs

Name	Type	Required	Description
chunks	mtmd_input_chunks *	Yes	Tokenized input chunks from mtmd_tokenize
lctx	llama_context *	Yes	LLM context for text decoding
pos0	llama_pos	Yes	Starting position for M-RoPE tracking
seq_id	llama_seq_id	Yes	Sequence ID for batched inference

Outputs

Name	Type	Description
n_tokens	size_t	Total number of tokens across all chunks
n_pos	llama_pos	Total positional extent of all chunks
return code	int	0 on success, negative on error

Usage Examples

// Load an image and evaluate
mtmd_bitmap * bmp = mtmd_helper_bitmap_init_from_file("photo.jpg");
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
mtmd_tokenize(ctx, chunks, text_input, &bmp, 1);

// Evaluate all chunks (text + image) in one call
int result = mtmd_helper_eval_chunks(ctx, lctx, chunks, n_past, seq_id, n_batch,
                                     /* logits_last */ true);

Related Pages

Principle:Ollama_Ollama_MultimodalPipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment