Implementation:Ggml org Llama cpp Mtmd Tokenize And Encode

Aspect	Detail
Implementation Name	Mtmd Tokenize And Encode
Doc Type	API Doc
Domain	Multimodal Inference
Purpose	Tokenizing mixed text+media prompts, encoding media chunks, and evaluating the full sequence
Related Workflow	Multimodal_Inference

Overview

Description

This implementation documents the three key functions that form the final stage of the multimodal inference pipeline:

mtmd_tokenize(): Splits a mixed text+media prompt into an ordered list of text and media chunks
mtmd_encode_chunk(): Encodes a single media chunk (image or audio) through the projector to produce embeddings
mtmd_helper_eval_chunks(): Orchestrates the full evaluation of all chunks through the language model, handling text decoding, media encoding, and batching automatically

Usage

These functions are called in sequence after bitmaps have been prepared. The typical flow is:

Create an empty mtmd_input_chunks container
Call mtmd_tokenize() with the text prompt and bitmap array
Either manually iterate over chunks calling mtmd_encode_chunk() and llama_decode(), or use mtmd_helper_eval_chunks() for the automated pipeline

Code Reference

Aspect	Detail
Header (tokenize/encode)	`tools/mtmd/mtmd.h:158-228`
Source (tokenize)	`tools/mtmd/mtmd.cpp:802-809`
Source (encode_chunk)	`tools/mtmd/mtmd.cpp:811-838`
Header (helper eval)	`tools/mtmd/mtmd-helper.h:56-63`
Source (helper eval)	`tools/mtmd/mtmd-helper.cpp:379-406`
Import	`#include "mtmd.h"` and `#include "mtmd-helper.h"`

mtmd_tokenize() signature (from mtmd.h):

// Tokenize an input text prompt and a list of bitmaps (images/audio).
// The prompt must have the input media marker (default: "<__media__>") in it.
// Number of bitmaps must equal the number of markers in the prompt.
// This function is thread-safe (shared ctx).
// Return values:
//   0 on success
//   1 on number of bitmaps not matching the number of markers
//   2 on image preprocessing error
MTMD_API int32_t mtmd_tokenize(mtmd_context * ctx,
                               mtmd_input_chunks * output,
                               const mtmd_input_text * text,
                               const mtmd_bitmap ** bitmaps,
                               size_t n_bitmaps);

mtmd_tokenize() source (mtmd.cpp:802-809):

int32_t mtmd_tokenize(mtmd_context * ctx,
            mtmd_input_chunks * output,
            const mtmd_input_text * text,
            const mtmd_bitmap ** bitmaps,
            size_t n_bitmaps) {
    mtmd_tokenizer tokenizer(ctx, text, bitmaps, n_bitmaps);
    return tokenizer.tokenize(output);
}

mtmd_encode_chunk() signature and source (mtmd.cpp:811-838):

// Returns 0 on success
MTMD_API int32_t mtmd_encode_chunk(mtmd_context * ctx,
                                   const mtmd_input_chunk * chunk);

The function dispatches based on chunk type:

Text chunks: No-op (returns 0 with a warning)
Image chunks: Encodes via the vision CLIP model
Audio chunks: Encodes via the audio CLIP model using clip_image_batch_encode()

mtmd_helper_eval_chunks() signature (from mtmd-helper.h):

// Automatically:
// 1. Run llama_decode() on text chunks
// 2. Run mtmd_encode_chunk() on media chunks, then mtmd_get_output_embd()
//    and then llama_decode()
// Returns 0 on success. NOT thread-safe.
MTMD_API int32_t mtmd_helper_eval_chunks(mtmd_context * ctx,
                                         struct llama_context * lctx,
                                         const mtmd_input_chunks * chunks,
                                         llama_pos n_past,
                                         llama_seq_id seq_id,
                                         int32_t n_batch,
                                         bool logits_last,
                                         llama_pos * new_n_past);

mtmd_helper_eval_chunks() source (mtmd-helper.cpp:379-406):

int32_t mtmd_helper_eval_chunks(mtmd_context * ctx,
                                struct llama_context * lctx,
                                const mtmd_input_chunks * chunks,
                                llama_pos n_past,
                                llama_seq_id seq_id,
                                int32_t n_batch,
                                bool logits_last,
                                llama_pos * new_n_past) {
    size_t n_chunks = mtmd_input_chunks_size(chunks);
    if (n_chunks == 0) {
        LOG_WRN("no chunks to eval\n");
        return 0;
    }
    for (size_t i = 0; i < n_chunks; i++) {
        bool chunk_logits_last = (i == n_chunks - 1) && logits_last;
        auto chunk = mtmd_input_chunks_get(chunks, i);
        int32_t res = mtmd_helper_eval_chunk_single(
            ctx, lctx, chunk, n_past, seq_id, n_batch, chunk_logits_last, &n_past);
        if (res != 0) {
            LOG_ERR("failed to eval chunk %zu\n", i);
            return res;
        }
        *new_n_past = n_past;
    }
    return 0;
}

I/O Contract

mtmd_tokenize():

Direction	Name	Type	Description
Input	ctx	`mtmd_context *`	Multimodal context
Input	output	`mtmd_input_chunks *`	Pre-allocated empty chunks container (from `mtmd_input_chunks_init()`)
Input	text	`const mtmd_input_text *`	Input text with media markers and tokenization flags
Input	bitmaps	`const mtmd_bitmap **`	Array of bitmap pointers (one per media marker)
Input	n_bitmaps	`size_t`	Number of bitmaps (must match marker count)
Output	(return)	`int32_t`	0 = success, 1 = bitmap/marker count mismatch, 2 = preprocessing error
Output	output (mutated)	`mtmd_input_chunks *`	Populated with ordered text and media chunks

mtmd_encode_chunk():

Direction	Name	Type	Description
Input	ctx	`mtmd_context *`	Multimodal context (contains encoder weights)
Input	chunk	`const mtmd_input_chunk *`	A single image or audio chunk from tokenization
Output	(return)	`int32_t`	0 = success, 1 = encoding failure
Output	(side effect)	internal	Encoded embeddings stored in context, retrievable via `mtmd_get_output_embd()`

mtmd_helper_eval_chunks():

Direction	Name	Type	Description
Input	ctx	`mtmd_context *`	Multimodal context
Input	lctx	`struct llama_context *`	Language model context for `llama_decode()`
Input	chunks	`const mtmd_input_chunks *`	Tokenized chunks from `mtmd_tokenize()`
Input	n_past	`llama_pos`	Starting position in KV cache
Input	seq_id	`llama_seq_id`	Sequence ID for KV cache
Input	n_batch	`int32_t`	Batch size for decoding
Input	logits_last	`bool`	Whether to compute logits for the last token only
Output	(return)	`int32_t`	0 = success, non-zero = error from encode or decode
Output	new_n_past	`llama_pos *`	Updated position after all chunks have been evaluated

Usage Examples

Example 1: Complete multimodal inference pipeline

#include "mtmd.h"
#include "mtmd-helper.h"

// Assume mtmd_ctx, llama model, and llama context are already initialized
// Assume bitmap has been loaded from file

// Prepare input text with media marker
mtmd_input_text input_text;
input_text.text = "Describe this image: <__media__>\nBe specific.";
input_text.add_special = true;
input_text.parse_special = true;

// Prepare bitmap array
const mtmd_bitmap * bitmaps[] = { bitmap_ptr };

// Tokenize
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
int32_t res = mtmd_tokenize(mtmd_ctx, chunks, &input_text, bitmaps, 1);
if (res != 0) {
    fprintf(stderr, "Tokenization failed: %d\n", res);
    mtmd_input_chunks_free(chunks);
    return 1;
}

// Evaluate all chunks through the model
llama_pos n_past = 0;
res = mtmd_helper_eval_chunks(mtmd_ctx, lctx, chunks, 0, 0, 512, true, &n_past);
if (res != 0) {
    fprintf(stderr, "Evaluation failed: %d\n", res);
}

// Now sample from logits at position n_past-1
// ... standard llama.cpp sampling loop ...

mtmd_input_chunks_free(chunks);

Example 2: Manual chunk-by-chunk processing

// After tokenization, manually process each chunk
size_t n_chunks = mtmd_input_chunks_size(chunks);
for (size_t i = 0; i < n_chunks; i++) {
    const mtmd_input_chunk * chunk = mtmd_input_chunks_get(chunks, i);
    enum mtmd_input_chunk_type type = mtmd_input_chunk_get_type(chunk);

    if (type == MTMD_INPUT_CHUNK_TYPE_TEXT) {
        size_t n_tokens;
        const llama_token * tokens = mtmd_input_chunk_get_tokens_text(chunk, &n_tokens);
        // Feed tokens to llama_decode()...
    } else {
        // Encode the media chunk
        int32_t res = mtmd_encode_chunk(mtmd_ctx, chunk);
        if (res != 0) {
            fprintf(stderr, "Encoding failed\n");
            break;
        }
        // Retrieve embeddings
        float * embd = mtmd_get_output_embd(mtmd_ctx);
        size_t n_tokens = mtmd_input_chunk_get_n_tokens(chunk);
        // Feed embeddings to llama_decode()...
    }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment